we need some better standard long-context benchmarks. needle in a haystack is not good for this, y...

bthornbury • today at 9:10 PM • 0 replies • view on HN

we need some better standard long-context benchmarks.

needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship.

something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.

alt Hacker News