Easily the most interesting part of this announcement is buried in the second to last paragraph: &...

gandreani • yesterday at 6:10 PM • 25 replies • view on HN

Easily the most interesting part of this announcement is buried in the second to last paragraph:

"We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity."

750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful.

Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.

Replies

qznc • yesterday at 10:11 PM

https://mikeveerman.github.io/tokenspeed/?rate=750&mode=thin...

This is what 750tps looks like, I guess.

➕ show 2 replies

sberens • yesterday at 6:18 PM

For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.

750 tokens/s for their largest model is going to be nuts

➕ show 4 replies

donquichotte • yesterday at 7:38 PM

> I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today.

Yup, I remember "racing" the AIs to figure things out in codebases just a year ago. Today, I have no chance. Whether it is due to degraded reasoning capabilities on my part or better models, I don't know.

➕ show 3 replies

eli • yesterday at 7:04 PM

I'm skeptical of how fast "up to" 750t/s really means. Maybe if they make it extremely expensive so it frees up enough capacity?

GPT‑5.3‑Codex‑Spark currently runs on Cerebras chips and it's giving me around 150t/s. Still relatively very fast, but nowhere near the 1,000t/s they claimed at launch. (Also it's not a very good model.)

That said, I'm super bought in to faster models being better for most use cases than smarter models.

➕ show 1 reply

linzhangrun • today at 1:40 AM

I saw videos of coding with Mimo-V2.5-Pro UltraSpeed, which is advertised at 1,000 tokens/s, which is very impressive.:

https://www.bilibili.com/video/BV1fME16uEW7

If the time-to-first-token latency also greatly improved, this could be very useful for end-to-end in controls, like autonomous driving for example.

tontinton • yesterday at 6:19 PM

Yep this is a glimpse into the future of 500+ t/s, which is in my opinion the next big thing that validates Jevon's paradox (the models are already smart enough)

➕ show 4 replies

bob1029 • yesterday at 8:43 PM

At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.

From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.

➕ show 6 replies

motoboi • yesterday at 7:41 PM

bean in mind that "GPT‑5.6 Sol on Cerebras at up to 750 tokens per second" not necessarily means the same model (in terms of inference result). It can mean anything like a very quantized model, a different level of model activation per inference etc.

Of course we can trust that wouldn't name the same thing with different levels of intelligence, right? Right?

➕ show 1 reply

_fat_santa • yesterday at 11:30 PM

I still use GPT-5.3-codex-spark which also runs on the Cerebras chips. Spark can run at >1000 tok/s but it's highly limited in it's context window size so it's not suitable many workflows.

Granted this will be a bit slower (relatively speaking) but it will still be awesome.

js2 • yesterday at 11:51 PM

> second to last

There's a word for this that you should never pass up an opportunity to use: penultimate. (You should also never pass up the opportunity to use "defenestrate," but it sadly does not apply here.)

easygenes • today at 6:13 AM

This is a strange one. We know the hardware capabilities of Cerebras force them to do aggressive REAP pruning to serve Kimi K2.6. Meaning that about 750B parameters is the upper limit of what they can serve economically. Not sure if this means Sol is smaller than anyone thinks or that they're just going to charge so much that a very inefficient serving regime is feasible.

trollbridge • today at 12:51 PM

This is something Xioami already did with MiMo-2.5-Pro a month ago, and at a higher speed (1,000 t/s).

750 tps at GPT-5.5-Pro prices would be ruinous!

qnleigh • today at 7:14 AM

Last I heard, Cerebras chips were entire wafers and would be extremely expensive. How could OpenAI possibly have enough of these to serve a popular model at scale?

Cryptosale75 • today at 4:41 AM

Cerebras is Milli Vanilli. They spend 10 years burning cash on a failed idea (which is frankly insane, since they should have figured out the limitations of heir stack in like... a weekend) and struck accidental gold with their 'Giant ass wafer'.

The company is valued like they broke open the grail, when in reality it's more like they bought a Cybertruck, got it stuck in the mud, and realized "You know what this thing does better than all other cars... shovel mud"

I'm shorting Cerebras with margin to virtually zero.

swalsh • yesterday at 8:17 PM

This would be amazing for some of our "real-time" workflows, that need to fallback to AI for one reason or another. What used to happen is a rules based system did the majority of work, and occasional corner case would fall back to humans. Then we moved AI in, still not real time, but much faster. Cerebras could make that even faster.

helloplanets • yesterday at 6:18 PM

OpenAI also announced two days ago that they're starting to make Cerebras style chips themselves [0], will be interesting to see how fast SotA model inference will be by the end of the year.

[0]: https://openai.com/index/openai-broadcom-jalapeno-inference-...

➕ show 4 replies

yiyingzhang • today at 4:05 AM

It all depends on the context window size. A small context size with fast performance won't be very useful today, as most workloads (like requests behind codex) usually have very long context.

jeswin • today at 2:42 AM

At thousands of tokens per second, LLMs (harnesses) can start to do a broader tree search of possibilities even in inefficient token space. This unlocks capabilities outside programming.

Avery29 • today at 6:41 AM

The speed sounds great，faster models make that gap much more visible..

nop17 • today at 2:18 AM

3x faster burn than 3x expensive token, generate more tokens, more fees

kingreflex • today at 10:04 AM

this means they also earn at a faster rate in some setups :)

lostmsu • yesterday at 7:54 PM

Does the Cerebras variant offer input caching and corresponding discounts? Last I checked Cerebras would not cache or would cache but not give discounts for the cached input, making it impractical for agentic use and multiturn conversations.

cruffle_duffle • yesterday at 6:45 PM

"we can start getting these answers back faster, they end up being more useful."

Dude, 10x token speed is going to be absolutely nuts. Half the "parallel subagent workflow" business seems to be driven simply as a means to avoid tapping your thumbs waiting for the infernal robot to finish something. If things come back speedy quick all the time, it should keep up with the "speed of the human" and let me stay focused on one thread instead of half a dozen. Plus the cost of screwing up gets significantly lower because you just re-fire with an adjusted prompt and iterate.

Someday these things will be 100x as fast as they are today and that is when things will get insane.

➕ show 1 reply

TacticalCoder • yesterday at 11:22 PM

> I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.

Yes: we have these new tools that are extremely good at helping us search through our codebases. Not just to find where/how functionalities are implemented: IMO bug searching is even way more powerful.

But: why would you want to compete with AI to do that? I cannot compete with grep/ripgrep... And I'm cool with that.

This lets you focus more on the more interesting parts, where AI/LLMs suck fat balls.

ai_fry_ur_brain • yesterday at 9:15 PM

From what I know about batch processing/ concurrency in inference this is a pipe dream... Or its going to cost an arm and a leg. I think they're lying or its going to be a much smaller model and not "frontier"

alt Hacker News

Replies