Data Engineering for Large Models
Data Engineering for Large Models: Architecture, Algorithms & Projects
Introduction
โData is the new oil, but only if you know how to refine it.โ
In the era of large models, data quality determines the upper bound of model performance. Yet systematic resources on LLM data engineering remain extremely scarce โ most teams are still learning by trial and error.
This book is designed to fill that gap. We systematically cover the complete technical stack from pre-training data cleaning to multimodal alignment, from RAG retrieval augmentation to synthetic data generation, including:
- ๐งน Pre-training Data Engineering: Extracting high-quality corpora from massive noisy data sources like Common Crawl
- ๐ผ๏ธ Multimodal Data Processing: Collection, cleaning, and alignment of image-text pairs, video, and audio data
- ๐ฏ Alignment Data Construction: Automated generation of SFT instruction data, RLHF preference data, and CoT reasoning data
- ๐ RAG Data Pipeline: Enterprise-grade document parsing, semantic chunking, and multimodal retrieval
Beyond in-depth theoretical explanations, the book includes 5 end-to-end capstone projects with runnable code and detailed architecture designs for hands-on learning.
Read Online: https://datascale-ai.github.io/data_engineering_book/en/

Links
https://github.com/datascale-ai/data_engineering_book/blob/main/README_en.md