logoalt Hacker News

wongarsutoday at 2:20 PM0 repliesview on HN

A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt

1: https://aclanthology.org/2024.sicon-1.2.pdf