Creating, Curating, and Cleaning Data for LLMs

Hamel Husain 4,038 10 months ago

Video Not Working? Fix It Now

Good data is a key component for creating a strong LLM. This talk will outline approaches to getting the best data for training your LLMs. The talk will cover: How to find existing datasets to build on top of, approaches to creating synthetic data and practical techniques and tools for exploring, deduplicating, and filtering datasets to enhance their quality. Slides, notes and other resources: https://parlance-labs.com/education/fine_tuning/daniel.html 0:00 Introduction Daniel Van Strien and David Berenstein introduce themselves and provide an overview of their talk. They discuss datasets in the context of Large Language Models (LLMs) and briefly outline the features available in the Huggingface datasets. 2:31 Reusing Existing Datasets Huggingface offers a wide range of datasets that are tailored to specific domains and tasks, though their relevance to your specific use case may vary. They provide various tools for searching, viewing, and exploring datasets. 7:14 Creating Your Own Datasets Datasets can be created by restructuring existing data, incorporating user feedback to tailor preferences, utilizing internal data sources, or generating synthetic data. The discussion includes preprocessing requirements essential for training LLMs. 9:04 Dataset Preparation Daniel explains the importance of formatting datasets to meet specific requirements for training LLMs, emphasizing scoping and planning based on user needs. 11:09 Supervised Fine-Tuning Datasets These datasets consist of question-answer pairs used to fine-tune models for specific tasks, facilitating the mapping of high-level concepts to data. 12:56 Direct Preference Optimization (DPO) Dataset Pairs inputs with preferred and rejected responses to guide models in generating desired outputs using ground truth and suboptimal examples. 14:43 Kahneman Tversky Optimization (KTO) Datasets These datasets feature binary feedback (thumbs up or thumbs down) on model responses, easily collected from user interactions in existing systems. 15:47 Spin and Orpo as Alternatives to DPO Spin generates synthetic data from minimal initial datasets to reduce data requirements, while Orpo streamlines training by skipping the fine-tuning step, employing a format similar to DPO. 17:56 Synthetic Data David discusses how LLMs generate synthetic datasets, enhancing model quality and complexity through prompts, completions, and AI-generated feedback for refining preferences. 20:25 Issues with Synthetic Data David highlights concerns such as hallucinations, toxicity, and stereotypes in models trained with synthetic data, potentially stemming from biases in the data-generating models. 21:18 Instruction-Based Dataset Evaluation Models complete instructions evaluated by GPT-4 for criteria like truthfulness and helpfulness, with simplification to an overall rating to reduce costs. Human review reveals coding errors, stressing the need for validation. 24:20 Considerations in Synthetic Data Creation Efficient scaling requires avoiding vendor lock-in, ensuring fault tolerance, and generating structured data formats like JSON, highlighting the complexity of the process. 25:17 Outlines Package Produces structured text generation with JSON output, optimizing token sampling for efficiency and accuracy to reduce inference time. 26:10 DSPy Package Focuses on programming prompts for LLMs, optimizing prompts and model weights through multiple API calls to improve prediction accuracy. 27:09 Distilabel Framework Uses a directed graph structure to generate synthetic data and AI feedback, enabling scalable and parallel execution for efficient data processing. 28:19 Improving Data Quality David discusses the iterative process of dataset improvement, emphasizing evaluations of diversity, quality, and quantity, where better data means higher quality rather than simply more data. 29:57 Data Improvement Strategies Deduplication and custom techniques like hashing and rule-based approaches using regex can enhance data quality. 31:53 Advanced Techniques for Data Cleaning Utilizing zero-shot models for initial topic predictions, classifiers for precise filtering, or LLMs for rationale-backed decisions, alongside intuitive text descriptive tools for straightforward data analysis. 32:27 Tools for Annotators David showcases various annotation tools available, ranging from pre-made interfaces to custom Gradio setups and robust tools like Lilac and Argilla. 41:41 Example Dataset Walkthrough Daniel walks through example DPO and KTO datasets, detailing the approach taken during dataset creation. 45:00 Case Study: LLM Summarizer Daniel discusses the pipeline for a summarizer he's developing, including preparations for the preference data pipeline. 50:48 Data Preparation Repository Daniel shares a repository containing notebooks covering the topics discussed in the talk. 51:42 Resources and Conclusion

Comment