Which "tests", exactly? Do tell. Tests where LLMs don't beat a human baseline is genuinely hard to come by nowadays.