My taxes are rather complex, so I ran the same exercise to see if Claude agreed with my accountant. An automated second opinion, so to speak. Spent about 6 minutes analyzing all the PDFs and basically nailed it perfectly in one shot.
My only point here is it sure seems the same activity / use case can have wildly different results across sessions or users. Customer support and product development in the age of non-deterministic software is a strange, strange beast.
What does nailing mean when you ask whether it agreed with your accountant?