Show HN: Data Engineering Book – An open source, community-driven guide

230 points • by xx123122 • yesterday at 9:35 PM • 27 comments • view on HN

Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).

The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.

The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.

Key Features:

LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.

Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").

Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.

This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!

Check it out:

Online: https://datascale-ai.github.io/data_engineering_book/

GitHub: https://github.com/datascale-ai/data_engineering_book

Comments

fudged71 • today at 5:50 PM

Thank you so much for this book! I'm finding the translation is very high quality.

I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.

I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.

hliyan • today at 5:59 AM

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:

> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure

https://github.com/datascale-ai/data_engineering_book/blob/m...

Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...

Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

➕ show 1 reply

esafak • today at 4:03 AM

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

➕ show 1 reply

baalimago • today at 11:30 AM

> "Data is the new oil, but only if you know how to refine it."

Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?

[0]: https://en.wikipedia.org/wiki/Petroleum

13pixels • today at 3:09 PM

The 'Vector DB vs Keyword Search' section caught my eye. In your testing for RAG pipelines, where do you draw the line?

We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.

➕ show 2 replies

osamabinladen • today at 5:15 AM

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

➕ show 2 replies

joshuaissac • yesterday at 10:58 PM

English version: https://github.com/datascale-ai/data_engineering_book/blob/m...

➕ show 2 replies

guillem_lefait • today at 12:54 AM

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

➕ show 1 reply

alexott • today at 8:08 AM

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

➕ show 1 reply

xx123122 • today at 4:11 AM

[dead]

dvrp • today at 12:32 AM

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to [email protected] !

rafavargascom • yesterday at 11:16 PM

谢谢

How is possible a Chinese publication gets to the top in HN?

➕ show 3 replies

MUSTANG303 • today at 6:41 AM

[dead]

alt Hacker News

Show HN: Data Engineering Book – An open source, community-driven guide

Comments