It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.
> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.
Yeah, but RAM prices are also back to 1990s levels.
simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"
https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...
I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain
The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.
We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.
Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.
Why is it wild that a LLM is as capable as a previously released LLM?
The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.