- This tracker not showing any visible degradation.
- Clearly incorrect answers being reported due to truncated thinking.
Is the tracker not measuring 'simpler' tasks that might get auto-sent to "low reasoning hell" even on high/xhigh? Is the clustering not actually causing reasoning misses in real-life coding, or not enough of a negative effect compared to the improvements made elsewhere? Something else?
So what are we to make of the two items:
- This tracker not showing any visible degradation. - Clearly incorrect answers being reported due to truncated thinking.
Is the tracker not measuring 'simpler' tasks that might get auto-sent to "low reasoning hell" even on high/xhigh? Is the clustering not actually causing reasoning misses in real-life coding, or not enough of a negative effect compared to the improvements made elsewhere? Something else?