Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools li...

microtonal • today at 6:12 PM • 17 replies • view on HN

Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.

I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.

I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

Replies

Brendinooo • today at 6:42 PM

Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement.

Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.

➕ show 6 replies

jerf • today at 7:24 PM

I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.

And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.

That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.

nozzlegear • today at 7:05 PM

> I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.

➕ show 2 replies

m3h • today at 8:00 PM

I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.

Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.

➕ show 1 reply

mark_l_watson • today at 8:26 PM

Good point, I also like to do the work myself, with an assistant under my control. I am usually really happy with DeepSeek v4 Flash that I feel just mostly does what I tell it to do, but I do switch to Pro for harder tasks.

There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.

People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)

jklmnopqrstuvw • today at 6:19 PM

From my own experience, GLM-5.2 generally cost more tokens and much more slow.

➕ show 3 replies

duxup • today at 8:39 PM

“Hey I saw some messed up function commented out that at face value is a bad idea… so here it is again with some nonsense assumptions ….”

I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.

mattmatheus • today at 8:18 PM

I've been working to use the best model for the task for about 6 months and have found great success doing plan with the 'frontier' model but punting implementation down to a 'lesser' model. I'm using the Beads-Rust (a rust fork of GasTown's beads) as my issue tracker. So far, so good.

whateveracct • today at 7:00 PM

agent-assisted development uses orders of magnitude fewer tokens than agent-driven development

the incentives aren't there sadly

➕ show 1 reply

arikrahman • today at 8:16 PM

I have also started shifting to models more reasonable for my wokflow. I've been using the Reasonix harness for Deepseek, and cache hits make the token use basically free. This is with unsubsidized models as well, using American providers.

bckr • today at 8:21 PM

I suggest you encoding your invariants in the harness. Architectural invariants that can be mechanically checked, including which modules are approved, which dependencies, etc.

xpct • today at 6:23 PM

I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.

I can't help but feel this is intentional towards the 'Agentic' workflow.

➕ show 4 replies

mohamedkoubaa • today at 6:12 PM

I've been moving more to Composer 2.5 for the same reason. KISS principle.

➕ show 1 reply

a_c • today at 7:29 PM

I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.

➕ show 1 reply

epolanski • today at 6:40 PM

I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.

Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.

It's really grim if you're looking for assistance instead of an implementor.

GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.

I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.

➕ show 4 replies

spullara • today at 8:38 PM

if you like that, use gpt models instead.

trollbridge • today at 7:42 PM

No kidding. I expect to have models to use which are optimised for different use cases.

Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.

alt Hacker News

Replies