It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.
https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.
> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.