This is somewhat out of date (Dec 2024), but gives you some idea of how far behind AMD was then: https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200...
Pull quotes:
AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.
[snip]
> The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us
[snip]
> AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.
etc etc. The whole thing is worth reading.
I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)
But the NVidia stack has improved too.
Anecdotal but over several years with an AMD GPU in my desktop I've tried multiple times to do real AI work and given up every time with the AMD stack.
Yet another reason to doubt claims that ”software is solved”.
Anthropic did retire an interview take-home assignment involving optimising inference on exotic hardware, because Claude could one shot a solution, but that was clearly a whiteboard hypothetical instead of a real system with warts, issues and nuance.
That’s insane. There should be a big team of people at AMD whose whole job is just to dogfood their stuff for training like this. Speaking of which, Amazon is in the same boat, I’m constantly surprised that Amazon is not treating improving Inferentia/Trainium software as an uber-priority. (I work at Amazon)