Easily the most interesting part of this announcement is buried in the second to last paragraph:
"We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity."
750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful.
Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.
For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.
750 tokens/s for their largest model is going to be nuts
> I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today.
Yup, I remember "racing" the AIs to figure things out in codebases just a year ago. Today, I have no chance. Whether it is due to degraded reasoning capabilities on my part or better models, I don't know.
I'm skeptical of how fast "up to" 750t/s really means. Maybe if they make it extremely expensive so it frees up enough capacity?
GPT‑5.3‑Codex‑Spark currently runs on Cerebras chips and it's giving me around 150t/s. Still relatively very fast, but nowhere near the 1,000t/s they claimed at launch. (Also it's not a very good model.)
That said, I'm super bought in to faster models being better for most use cases than smarter models.
I saw videos of coding with Mimo-V2.5-Pro UltraSpeed, which is advertised at 1,000 tokens/s, which is very impressive.:
https://www.bilibili.com/video/BV1fME16uEW7
If the time-to-first-token latency also greatly improved, this could be very useful for end-to-end in controls, like autonomous driving for example.
Yep this is a glimpse into the future of 500+ t/s, which is in my opinion the next big thing that validates Jevon's paradox (the models are already smart enough)
At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.
From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.
bean in mind that "GPT‑5.6 Sol on Cerebras at up to 750 tokens per second" not necessarily means the same model (in terms of inference result). It can mean anything like a very quantized model, a different level of model activation per inference etc.
Of course we can trust that wouldn't name the same thing with different levels of intelligence, right? Right?
I still use GPT-5.3-codex-spark which also runs on the Cerebras chips. Spark can run at >1000 tok/s but it's highly limited in it's context window size so it's not suitable many workflows.
Granted this will be a bit slower (relatively speaking) but it will still be awesome.
> second to last
There's a word for this that you should never pass up an opportunity to use: penultimate. (You should also never pass up the opportunity to use "defenestrate," but it sadly does not apply here.)
This is a strange one. We know the hardware capabilities of Cerebras force them to do aggressive REAP pruning to serve Kimi K2.6. Meaning that about 750B parameters is the upper limit of what they can serve economically. Not sure if this means Sol is smaller than anyone thinks or that they're just going to charge so much that a very inefficient serving regime is feasible.
This is something Xioami already did with MiMo-2.5-Pro a month ago, and at a higher speed (1,000 t/s).
750 tps at GPT-5.5-Pro prices would be ruinous!
Last I heard, Cerebras chips were entire wafers and would be extremely expensive. How could OpenAI possibly have enough of these to serve a popular model at scale?
Cerebras is Milli Vanilli. They spend 10 years burning cash on a failed idea (which is frankly insane, since they should have figured out the limitations of heir stack in like... a weekend) and struck accidental gold with their 'Giant ass wafer'.
The company is valued like they broke open the grail, when in reality it's more like they bought a Cybertruck, got it stuck in the mud, and realized "You know what this thing does better than all other cars... shovel mud"
I'm shorting Cerebras with margin to virtually zero.
This would be amazing for some of our "real-time" workflows, that need to fallback to AI for one reason or another. What used to happen is a rules based system did the majority of work, and occasional corner case would fall back to humans. Then we moved AI in, still not real time, but much faster. Cerebras could make that even faster.
OpenAI also announced two days ago that they're starting to make Cerebras style chips themselves [0], will be interesting to see how fast SotA model inference will be by the end of the year.
[0]: https://openai.com/index/openai-broadcom-jalapeno-inference-...
It all depends on the context window size. A small context size with fast performance won't be very useful today, as most workloads (like requests behind codex) usually have very long context.
At thousands of tokens per second, LLMs (harnesses) can start to do a broader tree search of possibilities even in inefficient token space. This unlocks capabilities outside programming.
The speed sounds great,faster models make that gap much more visible..
3x faster burn than 3x expensive token, generate more tokens, more fees
this means they also earn at a faster rate in some setups :)
Does the Cerebras variant offer input caching and corresponding discounts? Last I checked Cerebras would not cache or would cache but not give discounts for the cached input, making it impractical for agentic use and multiturn conversations.
"we can start getting these answers back faster, they end up being more useful."
Dude, 10x token speed is going to be absolutely nuts. Half the "parallel subagent workflow" business seems to be driven simply as a means to avoid tapping your thumbs waiting for the infernal robot to finish something. If things come back speedy quick all the time, it should keep up with the "speed of the human" and let me stay focused on one thread instead of half a dozen. Plus the cost of screwing up gets significantly lower because you just re-fire with an adjusted prompt and iterate.
Someday these things will be 100x as fast as they are today and that is when things will get insane.
> I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.
Yes: we have these new tools that are extremely good at helping us search through our codebases. Not just to find where/how functionalities are implemented: IMO bug searching is even way more powerful.
But: why would you want to compete with AI to do that? I cannot compete with grep/ripgrep... And I'm cool with that.
This lets you focus more on the more interesting parts, where AI/LLMs suck fat balls.
From what I know about batch processing/ concurrency in inference this is a pipe dream... Or its going to cost an arm and a leg. I think they're lying or its going to be a much smaller model and not "frontier"
https://mikeveerman.github.io/tokenspeed/?rate=750&mode=thin...
This is what 750tps looks like, I guess.