logoalt Hacker News

kristjanssontoday at 4:10 AM0 repliesview on HN

All evidence is that the final training runs across thousands to low tens of thousands of GPU, and that a single instance of the resulting model runs (or could run) well within a rack (ie NVL72).

The massive scale is all massively parallel: test-time compute for users, test time compute for RL rollouts (and probably increasingly environments for those rollouts), other synthetic data generation, research experiments, …