This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:
- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).
This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.
There are a few drawbacks though:
- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...
Hopefully StepFun will do a new release which addresses these issues.
BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1
Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?
Did anyone do this kind of math?
Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.
It’s my layman understanding that would have to be fixed in the model weights itself?
Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.
Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.
I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.
That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.