The Evals That Made GitHub Copilot

Hamel Husain 2,600 2 days ago

Video Not Working? Fix It Now

Learn how AI evals were critical to GitHub Copilot's early success. 35% off our upcoming evals course: https://maven.com/parlance-labs/evals?promoCode=hamel-ns4591 06:01 - Overview: how evaluations were central to Copilot's development and success. 06:12 - Introducing the Four Main Types of Evals: Algorithmic, Verifiable, LLM-as-Judge, and A/B Testing. 09:05 - Details on Harnesslib's goals, its process of collecting test samples, running tests against generated code, and its success rate criteria. 12:27 - Harnesslib lessons learned: Key takeaways including ensuring test content isn't in training data, consistency with production traffic, testing the entire system, and keeping the harness flexible. 15:56 - A/B Tests for Online Traffic: how the team ensured model/prompt changes are acceptable before full rollout. 17:51 - Key Metrics and Guardrail Metrics: Discussion of key metrics (acceptance rate, characters retained, latency) and a multitude of guardrail metrics. 22:07 - LLM-as-Judge: how the team used LLMs to make subjective quality judgments for evolving chat experiences. How they transitioned from human baselines to fully LLM-as-Judge rubrics with specific criteria. 29:07 - Evolution of evals: the unsuitability of Harnesslib for new chat products, the product-building focus of evals, and the "who's judging the judge?" challenge. 38:55 - Algorithmic Tool Use Evaluation, ensuring the correct tools (functions) were being called by the LLM, and the utility of confusion matrices. 42:09 - Summary: John recaps where each of the four evaluation types was applied in Copilot's development. 43:22 - Q&A

Comment