I don't know what the Y-axis is supposed to be on that Wharton AI capabilities graph, but I am ...

gm678 • today at 4:34 PM • 6 replies • view on HN

I don't know what the Y-axis is supposed to be on that Wharton AI capabilities graph, but I am not really convinced that Opus 4.6 has more than double the intelligence/capability/whatever of GPT 5.1 Max.

Replies

NitpickLawyer • today at 4:37 PM

IIRC that graph tracks capabilities as time_to_solve a task for humans (i.e. the model can now handle tasks that usually take a human ~8h). Which, depending on what tasks you look at, could be a reasonable finding. I could see Opus 4.6 handling tasks that take ~8h for humans, and that 5.1 couldn't previously handle (with 5.1 being "limited" at 4h tasks let's say). It is a bit arbitrary, but I think this is what they're tracking.

➕ show 3 replies

strken • today at 5:12 PM

Check out Re-Bench and HCAST.

The tasks are obviously all of the form "Go do this, and if you get the following output you passed". Setting up a web server apparently takes 15 minutes for a human, which is news to me since I'm able to search for https://gist.github.com/willurd/5720255, find the python one-liner, and copy it within about ten seconds.

Anyway, this is cool but it does not mean Claude can perform any human tasks that take less than 8 hours and are within its physical capabilities.

throwaway27448 • today at 5:20 PM

> more than double the intelligence/capability/whatever

I'm curious what people really mean when they say this. Intelligence is famously hard to define, let alone measure; it certainly doesn't scale linearly; it only loosely correlates to real-world qualities that are easy to measure; etc. Are you referring to coding ability or...?

adw • today at 5:30 PM

https://podcasts.apple.com/us/podcast/machine-learning-stree... is a pretty good primer on METR, what it measures, and its limitations.

myhf • today at 5:07 PM

According to this article: whenever someone games a benchmark to make an upward chart on some y-axis, it's YOUR responsibility to prove how and why that trend can't continue indefinitely.

emoji face with eyes rolling upward

➕ show 2 replies

BoredPositron • today at 4:36 PM

https://metr.org/time-horizons/ on linear scale. Clickbait garbage article as most of his in the last year.

➕ show 1 reply

alt Hacker News

Replies