I dislike the non-specificity of "models" here. Different models have different attention architectures, and can therefore have significant differences in long-context behavior. It's true that long context is an issue can most models do drop off in quality, but I would not extrapolate behavior of old models to new ones.