logoalt Hacker News

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

202 pointsby kristianptoday at 2:32 AM86 commentsview on HN

Comments

tarrudatoday at 12:45 PM

This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

show 5 replies
lm2stoday at 4:25 PM

Loved reading the reasoning[0] for the recent "Walk or drive to the carwash" trick.

[0] https://gist.github.com/lm2s/c4e3260c3ca9052ec200b19af9cfd70...

Not sure if it's directly accessible, but here's the link: https://stepfun.ai/chats/213451451786883072

anentropictoday at 10:11 AM

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

show 3 replies
danieltanfh95today at 5:05 AM

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

show 1 reply
kristianptoday at 2:33 AM

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

show 2 replies
culitoday at 8:21 AM

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

show 1 reply
mohsen1today at 9:28 AM

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

show 1 reply
hedgehogtoday at 4:23 PM

In a quick test using a few of my standard test prompts a few observations: 1) the trace was very verbose and written in an odd style reminiscent of chat or those annoying one-sentence-per-paragraph LinkedIn posts; 2) token output rate is very high on the hosted version; 3) conformance to instructions and output quality was better than most of the leading models I've tested (e.g. Opus 4.5)

janalsncmtoday at 9:37 AM

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

show 2 replies
tallesborges92today at 12:21 PM

I’ve been using this model for a while, and it’s very fast. It spent some time thinking but does fewer calls. For example, yesterday I asked the agent to find the Gemini quota limit for their API, and it took 27 seconds and just 2 calls, Opus 4.6 took 33 seconds, but 5 calls with less thinking

wmftoday at 3:35 AM

That reverse x axis sure is confusing.

show 1 reply
Mashimotoday at 10:48 AM

Holy moly, I made a simple coding promt and the amount of reasoning output could fill a small book.

> create a single html file with a voxel car that drives in a circle.

Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.

show 1 reply
prmphtoday at 9:43 AM

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

show 6 replies
ameliustoday at 10:54 AM

Does it pass the carwash test?

show 2 replies
SilverElfintoday at 4:53 AM

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.

show 4 replies
sinenominetoday at 9:33 AM

Works impressively well with pi.dev minimal agent.

lostmsutoday at 3:35 PM

Any pelicans from non-quantized variants?

agentifyshtoday at 6:37 AM

what country is behind this one ?

show 1 reply
octoclawtoday at 2:04 PM

[dead]