logoalt Hacker News

nijavetoday at 12:35 PM0 repliesview on HN

I'll respond with more anecdotal evidence, the Llama family has been terrible at following directions in all the tests I've done--not sure about the other models in RULER.

In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4

Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.