MENU

Fun & Interesting

Preparing Fineweb - A Finely Cleaned Common Crawl Dataset

Trelis Research 3,224 lượt xem 10 months ago
Video Not Working? Fix It Now

➡️ Newsletter: https://blog.Trelis.com
➡️ Resources/Support/Discord: https://Trelis.com/About

VIDEO RESOURCES:
- Slides: https://docs.google.com/presentation/d/15KQFPUFPKi3G3VpFvXOhO6ceX4nrUdg7zn-L-cL96gg/edit?usp=sharing
- Dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
- fineweb.py: https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py
- Fineweb blog: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

TIMESTAMPS:
0:00 Common Crawl Data Processing Pipeline
0:42 Video Overview
1:50 Common Crawl Raw Dataset
3:19 Common Crawl improves over time?
5:35 Dataset Comparisons: C4, RefinedWeb, Fineweb, Llama 3, Phi-3
9:38 Data Processing Pipeline and Datatrove
13:39 Quality filters: Gopher, C4, Fineweb
20:39 Deduplication strategies
25:43 Fineweb edu: LLM assisted dataset filtering
28:43 Training a classifier for dataset filtering
33:15 My recommendation: Fineweb Edu Latest Crawl
35:22 Why is Llama 3 better than Llama 2?

Comment