logoalt Hacker News

virgildotcodesyesterday at 7:58 AM2 repliesview on HN

I think you're speeding past the word "average" in the sentence. I'd argue that current frontier models already exceed the abilities of average humans across the majority of tasks you can do on a computer, although you might be able to argue that they tend to be a bit slower?

That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?


Replies

UltraSaneyesterday at 10:09 PM

https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...

AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.

Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.

UltraSaneyesterday at 8:23 AM

" I'd argue that current frontier models already exceed the abilities of average humans " for things that fit in their context window sure but LLMs can't learn over time the way humans can. One example is LLMs are very good at writing a few thousands line of code but they absolutely cannot write coherent million line codebases. By average human I meant the average skill level for the job. AGI would need to be able to pass a interview and get hired and the perform well enough to not get fired.

show 2 replies