I love my Spark-alike, but they really aren't inference boxes IMO. They're experimentation boxes. A couple of 3080 20GB's for cheap from China, a 5090, an RTX Pro 6000 if you can swing the horrible cost: those are better choices IMO
That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough
I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP
I love my Spark-alike, but they really aren't inference boxes IMO. They're experimentation boxes. A couple of 3080 20GB's for cheap from China, a 5090, an RTX Pro 6000 if you can swing the horrible cost: those are better choices IMO
That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough
I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP