If you really want to see fully open training pipelines for modern LLMs, Olmo and to a lesser extent Nemotron are what you should look at.
Check out OpenThoughts. It has a widely used dataset, a model that beats the deepseek's smaller reasoning models, and a paper that talks in detail about the data curation methodology.
Too old now
What is the estimated cost these days to train something like this to conclusion?
"This will likely involve curating new, large-scale datasets for math, reasoning, and code.". ... everybody likes to hand-wave on this .
[dead]
Last update over a year ago, so I hope (2025) gets added to the title:
> [2025/05/26] (Step 1 completed!) We release Mixture-of-Thoughts--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train OpenR1-Distill-7B, which replicates the reasoning capabilities of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.
Doesn't look like they managed to actually reproduce R1, and only stopped on Step 1 out of their 3-step plan.