logoalt Hacker News

m3htoday at 4:32 PM1 replyview on HN

Indeed, it looks like my work has suffered from the clustering issue as well:

  reasoning_output_tokens    count    percent
  ━━━━━━━━━━━━━━━━━━━━━━━━━  ━━━━━━━  ━━━━━━━━━
                         0      873    28.5948
  ─────────────────────────  ───────  ─────────
                         8       64     2.0963
  ─────────────────────────  ───────  ─────────
                         9       60     1.9653
  ─────────────────────────  ───────  ─────────
                        11       54     1.7688
  ─────────────────────────  ───────  ─────────
                       516       48     1.5722
  ─────────────────────────  ───────  ─────────
                        12       45     1.4740
  ─────────────────────────  ───────  ─────────
                        10       43     1.4085
  ─────────────────────────  ───────  ─────────
                        17       40     1.3102
  ─────────────────────────  ───────  ─────────
                        13       38     1.2447
  ─────────────────────────  ───────  ─────────
                        14       36     1.1792
Created a script for this: https://github.com/thehappybug/codex-reasoning-token-check

Replies

m3htoday at 4:45 PM

When I reviewed the conversations affected by this issue, they did not always align with my feeling of "degraded output".

Some were definitely below par, and I recall having to iterate on the generated code more than I wanted to. However, it is only true for a very small number of conversations.

So we're looking at a small set of affected conversations, and even within that small set, only a few will have degraded output, likely because the model can compensate for the reasoning defect over the long conversation.

show 1 reply