All evidence is that the final training runs across thousands to low tens of thousands of GPU, and that a single instance of the resulting model runs (or could run) well within a rack (ie NVL72).
The massive scale is all massively parallel: test-time compute for users, test time compute for RL rollouts (and probably increasingly environments for those rollouts), other synthetic data generation, research experiments, …