logoalt Hacker News

kostajtoday at 1:32 PM0 repliesview on HN

Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.