logoalt Hacker News

ipunchghoststoday at 12:43 PM3 repliesview on HN

I think ppl only care about how Claude or codex does.


Replies

kostajtoday at 1:39 PM

GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.

show 1 reply
spprashanttoday at 12:48 PM

Looks like they land at the average number of 67% disagreement.

airstriketoday at 12:44 PM

I agree but the market is pricing way beyond that