Evaluating LLMs with Chatbot Arena and Joseph E. Gonzalez of RunLLM

Weights & Biases 1,128 4 months ago

Video Not Working? Fix It Now

In this episode of Gradient Dissent, Joseph E. Gonzalez, EECS Professor at UC Berkeley and Co-Founder at RunLLM, joins host Lukas Biewald to explore innovative approaches to evaluating LLMs. They discuss the concept of vibes-based evaluation, which examines not just accuracy but also the style and tone of model responses, and how Chatbot Arena has become a community-driven benchmark for open-source and commercial LLMs. Joseph shares insights on democratizing model evaluation, refining AI-human interactions, and leveraging human preferences to improve model performance. This episode provides a deep dive into the evolving landscape of LLM evaluation and its impact on AI development. ⏳Timestamps: [00:00] Introduction [00:57] Research Highlights [03:12] Evaluating "Vibes" in LLMs [09:12] Conciseness vs. Accuracy [16:18] Chatbot Arena: Origins and Evolution [19:02] Understanding Style in LLM Responses [23:04] Challenges with Theory of Mind in Multi-Agent Systems [26:22] LLMs as Judges: Strengths and Biases [33:54] Table-Augmented Generation (TAG) [38:39] Reducing Hallucinations in LLMs [43:50] Multi-Agent Collaboration [46:17] Model Routing and RunLM [52:59] Reflections on Product Development vs. Research 🎙 Get our podcasts on these platforms: Apple Podcasts: http://wandb.me/apple-podcasts Spotify: http://wandb.me/spotify Google: http://wandb.me/gd_google YouTube: http://wandb.me/youtube Follow Weights & Biases: https://twitter.com/weights_biases https://www.linkedin.com/company/wandb Join the Weights & Biases Discord Server: https://discord.gg/CkZKRNnaf3

Comment