Shreya Shankar and Hamel Husain discuss common mistakes people make when creating domain-specific evals.
LLM Evals Course for Engineers (35% Discount): http://bit.ly/eval-discount
00:51 Foundation model benchmarks are not the same as your application evals
03:00 Generic Evals Are Useless
04:00 Do not outsource labeling & prompting to non domain experts
09:28 You should make your own data annotation app
12:40 Your LLM prompts should be specific and grounded in error analysis
15:25 Use binary labels
18:57 Look at your data
23:41 Be careful of overfitting to test data
25:40 Do online tests