https://latent.space/2024-syndata-smolmodels
Loubna Ben Allal, who works on synthetic data and Smol Language Models at Huggingface, dropped by to drop knowledge on all the work of the year.
Timestamps
00:00 Introduction and Overview
00:18 Synthetic Data in 2024
01:09 Synthetic Data in Pre-Training
02:57 Model Collapse Concerns
04:11 Synthetic Data Quality and Benchmarks
08:51 Rephrasing and Textbook Generation
11:17 Synthetic Data for Filtering and Classification
13:28 Post-Training with Synthetic Data
16:17 Advancements in Small Models
18:17 On-Device and Efficient Models
25:14 Future Trends and Conclusion
Synthetic Data
We called out the Synthetic Data debate at last year’s NeurIPS, and no surprise that 2024 was dominated by the rise of synthetic data everywhere:
Apple’s Rephrasing the Web, Microsoft’s Phi 2-4 and Orca/AgentInstruct, Tencent’s Billion Persona dataset, DCLM, and HuggingFace’s FineWeb-Edu, and Loubna’s own Cosmopedia extended the ideas of synthetic textbook and agent generation to improve raw web scrape dataset quality
This year we also talked to the IDEFICS/OBELICS team at HuggingFace who released WebSight this year, the first work on code-vs-images synthetic data.
We called Llama 3.1 the Synthetic Data Model for its extensive use (and documentation!) of synthetic data in its pipeline, as well as its permissive license.
Nemotron CC and Nemotron-4-340B also made a big splash this year for how they used 20k items of human data to synthesize over 98% of the data used for SFT/PFT.
Cohere introduced Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress observing gains of up to 56.5% improvement in win rates comparing multiple teachers vs the single best teacher model
In post training, AI2’s Tülu3 (discussed by Luca in our Open Models talk) and Loubna’s Smol Talk were also notable open releases this year.
This comes in the face of a lot of scrutiny and criticism, with Scale AI as one of the leading voices publishing AI models collapse when trained on recursively generated data in Nature magazine bringing mainstream concerns to the potential downsides of poor quality syndata:
Part of the concerns we highlighted last year on low-background tokens are coming to bear: ChatGPT contaminated data is spiking in every possible metric:
But perhaps, if Sakana’s AI Scientist pans out this year, we will have mostly-AI AI researchers publishing AI research anyway so do we really care as long as the ideas can be verified to be correct?
Smol Models
Meta surprised many folks this year by not just aggressively updating Llama 3 and adding multimodality, but also adding a new series of “small” 1B and 3B “on device” models this year, even working on quantized numerics collaborations with Qualcomm, Mediatek, and Arm. It is near unbelievable that a 1B model today can qualitatively match a 13B model of last year:
and the minimum size to hit a given MMLU bar has come down roughly 10x in the last year. We have been tracking this proxied by Lmsys Elo and inference price:
The key reads this year are:
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Apple Intelligence Foundation Language Models
Hymba: A Hybrid-head Architecture for Small Language Models
Loubna’s SmolLM and SmolLM2: a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters on the pareto efficiency frontier.
and Moondream, which we already covered in the 2024 in Vision talk