Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs. >...

GeekyBear • today at 5:27 AM • 0 replies • view on HN

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

alt Hacker News