As long as you don't keep calling out to the CPU, that is. Tool calling, searches, cache move...

BillStrong • today at 3:53 PM • 0 replies • view on HN

As long as you don't keep calling out to the CPU, that is.

Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.

There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.

Its been a month, so I don't remember more details than this.

alt Hacker News