All evidence is that the final training runs across thousands to low tens of thousands of GPU, and t...

kristjansson • today at 4:10 AM • 0 replies • view on HN

All evidence is that the final training runs across thousands to low tens of thousands of GPU, and that a single instance of the resulting model runs (or could run) well within a rack (ie NVL72).

The massive scale is all massively parallel: test-time compute for users, test time compute for RL rollouts (and probably increasingly environments for those rollouts), other synthetic data generation, research experiments, …

alt Hacker News