we need some better standard long-context benchmarks.
needle in a haystack is not good for this, yes it proves the model can attend to its context, but in its usual form, somewhat trivializes the query-key relationship.
something like long-form Q&A would be more ideal. Like reading a book and answering questions that require synthesizing information derived from either the whole thing or disparate portions of it. Like describing an entire character arc in a 1000 page novel with examples and evidential moments.