How to Evaluate LLM Performance for Domain-Specific Use Cases

Snorkel AI 7,324 9 months ago

Video Not Working? Fix It Now

LLM evaluation is critical for generative AI in the enterprise, but measuring how well an LLM answers questions or performs tasks is difficult. Thus, LLM evaluations must go beyond standard measures of “correctness” to include a more nuanced and granular view of quality. In practice, enterprise LLM evaluations (e.g., OSS benchmarks) often come up short because they’re slow, expensive, subjective, and incomplete. That leaves AI initiatives blocked because there is no clear path to production quality. In this video, Vincent Sunn Chen, Founding Engineer at Snorkel AI, and Rebekah Westerlind, Software Engineer at Snorkel AI, discuss the importance of LLM evaluation, highlight common challenges and approaches, and explain the core concepts behind Snorkel AI's approach to data-centric LLM evaluation. In this video, you’ll learn more about: * Understanding the nuances of LLM evaluation. * Evaluating LLM response accuracy at scale. * Identifying where additional LLM fine-tuning is needed. See more videos from Snorkel AI here: youtube.com/channel/UC6MQ2p8gZFYdTLEV8cysE6Q?sub_confirmation=1 Learn more about LLM evaluation here: https://snorkel.ai/llm-evaluation-primer/ Timestamps: 01:07 Agenda 01:40: Why do we need LLM evaluation? 02:55 Common evaluation axes 04:05 Why eval is more critical in Gen AI use cases 05:55 Why enterprises are often blocked on effective LLM evaluation 07:30 Common approaches to LLM evaluation 08:30 OSS benchmarks + metrics 09:40 LLM-as-a-judge 11:20 Annotation strategies 12:50 How can we do better than manual annotation strategies? 16:00 How data slices enable better LLM evaluation 18:00 How does LLM eval work with Snorkel? 20:45 Building a quality model 24:10 Using fine-grained benchmarks for next steps 25:50 Workflow overview (review) 26:45 Workflow—starting with the model 28:08 Workflow—Using an LLM as a judge 28:40 Workflow—the quality model 30:00 Chatbot demo 31:46 Annotating data in Snorkel Flow (demo) 34:49 Building labeling functions in Snorkel Flow (demo) 40:15 LLM evaluation in Snorkel Flow (demo) 41:58 Snorkel Flow jupyter notebook demo 44:28 Data slices in Snorkel Flow (demo) 46:51 Recap 49:25 Snorkel eval offer! 50:31 Q&A #enterpriseai #largelanguagemodels #evaluation

Comment