Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.
Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.
[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...
Q4_K_S @ 116 GB
IQ4_NL @112 GB
Q4_0 @ 113 GB
Which of these would be technically better?
[1] https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-G...
> Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.
Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.