I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations
The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine
The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine