| alt Hacker News

UltraSane • yesterday at 10:09 PM • 0 replies • view on HN

https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...

AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.

Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.