On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with...

trouve_search • yesterday at 4:59 PM • 2 replies • view on HN

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

Replies

adam_arthur • yesterday at 5:10 PM

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

diddid • yesterday at 10:43 PM

With the 5090 you need to buy the rest of the computer though, and the Dgx spark will run 1/4th as slow but use 1/5th the electricity. And the spark would be able to run things the 5090 just couldn’t, like the Qwen3.5 122b. Which is all just to say that for llm workflows there is no easy answer. And if you media generation it gets even more complicated.

alt Hacker News

Replies