They tried and failed. xAi made a mistake building Colossus 1 and ended up with heterogenous cluster of H100/H200/GB200 GPUs. This is a nightmare to train huge models on because each card has different specs, features, and hardware requirements. During gradient synchronization, a heterogeneous cluster would bottleneck on the slowest GPU (H100) so the faster GPUs would end up idling. They also probably ran into unexpected compatibility issues, which are difficult to resolve.
It makes more sense to use this cluster for inference, since they can segment the cluster by GPU type and avoid GPU mixing. xAI doesn't have enough inference customers so it makes sense to monetize this to companies that need inference compute such as Anthropic or Cursor.
Apparently xAI will try building SOTA models on Colossus 2, which will be built on Blackwell GPUs only.
How can something so obvious be overlooked by team building the data centre? Can't the sharding be uneven so that weaker GPUs still finish fast by taking on a smaller workload?