Not hard to understand what's going on here. They RL'd around patterns in their data and s...

CuriouslyC • today at 3:47 PM • 1 reply • view on HN

Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

Replies

leerob • today at 4:58 PM

(I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.

alt Hacker News

Replies