If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.
Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.
Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}
Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}
Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.
FWIW I built a text classification tool for internal use using (at this point 1 year old) frontier models and found that asking for reasoning significantly increased precision and recall.