RLHF & DPO Explained (In Simple Terms!)

Entry Point AI 8,616 lượt xem 10 months ago

Video Not Working? Fix It Now

Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.

This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.

0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters

These are the three papers referenced in the video:

1. Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741)
2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290)
3. KTO: Model Alignment as Prospect Theoretic Optimization (https://arxiv.org/abs/2402.01306)

The Huggingface TRL library offers implementations for PPO, DPO, and KTO:
https://huggingface.co/docs/trl/main/en/kto_trainer

Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:
https://www.entrypointai.com/

How about connecting? I'm on LinkedIn:
https://www.linkedin.com/in/markhennings/

ai

llm

fine-tuning

entry point ai

Comment