The problem with "context rot" is that its existence and severity is purely anecdotal. As far as I know, nobody has actually measured context rot systematically. The only thing we know is that memory degrades somewhat in long contexts, via things like needle in haystack tests. But that's not the same issue. Context rot is usually taken to mean that the model gets dumber even if it doesn't need to remember specific things in its context window.
This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?
It's pretty hard to measure because most context rot comes from related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.
Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.
If you take a standard benchmark and just prepend a random book to it, it will not capture that