Doing LLM evaluation right is crucial, but very challenging! We'll cover the basics of how LLM evaluation can be performed, many (but not all) of the ways it can go wrong. We'll also discuss tools available to make life easier, including the LM Evaluation Harness, along with domain-specific use cases.
Resources, links and other info available here: https://parlance-labs.com/education/evals/schoelkopf.html
*00:00 Introduction to LLM Evaluation Deep Dive*
The complexities of LLM evaluation, including contributions from Eleuther AI to open-source AI and model evaluation, and the use and evolution of the LM Evaluation Harness.
*01:49 Scoring Challenges in LLM Evaluation*
The complexities of accurately scoring LLMs, particularly when evaluating natural language responses to factual queries, and the importance of robust evaluation techniques.
*05:35 Log-likelihood Evaluation*
Insights into log-likelihood evaluation techniques, generating next-word probabilities in sequence models, and how the autoregressive transformer architecture aids in training and evaluation, including practical aspects of using log-likelihoods.
*13:53 Multiple Choice Evaluation and Downstream Concern*
The benefits and limitations of multiple choice evaluations for LLMs, including their simplicity and cost-effectiveness compared to long-form generation, and the necessity of aligning evaluation strategies with practical use cases.
*18:46 Perplexity Evaluation*
Perplexity as a measure of model performance, the process for calculating perplexity, its utility and limitations, and how different tokenizers can impact model comparability.
*22:44 Text Generation Evaluation*
The challenges of evaluating text generation, including difficulties in scoring free-form natural language and the impact of tokenization on evaluation results, and the importance of careful evaluation setup to avoid biased outcomes.
*27:40 Importance of Transparency and Reproducibility in Evaluations*
The importance of transparency and reproducibility in LLM evaluations, the challenges of achieving reproducible results, and the need for detailed reporting and sharing of evaluation methodologies and code.
*38:23 Audience Q&A*
Practical advice and broader conceptual understanding through the Q&A session, addressing various questions about using specific evaluation frameworks and the effectiveness and limitations of current LLM evaluation methods.