Indeed, it looks like my work has suffered from the clustering issue as well:
reasoning_output_tokens count percent
━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━ ━━━━━━━━━
0 873 28.5948
───────────────────────── ─────── ─────────
8 64 2.0963
───────────────────────── ─────── ─────────
9 60 1.9653
───────────────────────── ─────── ─────────
11 54 1.7688
───────────────────────── ─────── ─────────
516 48 1.5722
───────────────────────── ─────── ─────────
12 45 1.4740
───────────────────────── ─────── ─────────
10 43 1.4085
───────────────────────── ─────── ─────────
17 40 1.3102
───────────────────────── ─────── ─────────
13 38 1.2447
───────────────────────── ─────── ─────────
14 36 1.1792
Created a script for this: https://github.com/thehappybug/codex-reasoning-token-check
When I reviewed the conversations affected by this issue, they did not always align with my feeling of "degraded output".
Some were definitely below par, and I recall having to iterate on the generated code more than I wanted to. However, it is only true for a very small number of conversations.
So we're looking at a small set of affected conversations, and even within that small set, only a few will have degraded output, likely because the model can compensate for the reasoning defect over the long conversation.