This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Co...

ohadron • yesterday at 6:21 PM • 6 replies • view on HN

This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?

Replies

For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.

Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.

black_knight • yesterday at 6:59 PM

People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.

I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!

➕ show 1 reply

tekacs • yesterday at 7:04 PM

Google's 3.5 Flash – which came out yesterday – is 200-300 tokens/second (albeit purportedly inefficient in its use of reasoning tokens) and according to Google, 800-1500+ tokens/second on their 8i TPUs when they're out!

It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.

c7b • yesterday at 6:48 PM

Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?

➕ show 2 replies

8note • yesterday at 6:36 PM

i really want a qwen on one of these chips: https://chatjimmy.ai

15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem

➕ show 1 reply

philipp-gayret • yesterday at 6:29 PM

If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.

➕ show 2 replies

alt Hacker News

Replies