Learn how AI evals were critical to GitHub Copilot's early success.
35% off our upcoming evals course: https://maven.com/parlance-labs/evals?promoCode=hamel-ns4591
06:01 - Overview: how evaluations were central to Copilot's development and success.
06:12 - Introducing the Four Main Types of Evals: Algorithmic, Verifiable, LLM-as-Judge, and A/B Testing.
09:05 - Details on Harnesslib's goals, its process of collecting test samples, running tests against generated code, and its success rate criteria.
12:27 - Harnesslib lessons learned: Key takeaways including ensuring test content isn't in training data, consistency with production traffic, testing the entire system, and keeping the harness flexible.
15:56 - A/B Tests for Online Traffic: how the team ensured model/prompt changes are acceptable before full rollout.
17:51 - Key Metrics and Guardrail Metrics: Discussion of key metrics (acceptance rate, characters retained, latency) and a multitude of guardrail metrics.
22:07 - LLM-as-Judge: how the team used LLMs to make subjective quality judgments for evolving chat experiences. How they transitioned from human baselines to fully LLM-as-Judge rubrics with specific criteria.
29:07 - Evolution of evals: the unsuitability of Harnesslib for new chat products, the product-building focus of evals, and the "who's judging the judge?" challenge.
38:55 - Algorithmic Tool Use Evaluation, ensuring the correct tools (functions) were being called by the LLM, and the utility of confusion matrices.
42:09 - Summary: John recaps where each of the four evaluation types was applied in Copilot's development.
43:22 - Q&A