logoalt Hacker News

CuriouslyCtoday at 3:47 PM1 replyview on HN

Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.


Replies

leerobtoday at 4:58 PM

(I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.