To keep with the analogy, isn't that sort of like testing two cars by having them both drive th...

kbenson • today at 5:53 PM • 0 replies • view on HN

To keep with the analogy, isn't that sort of like testing two cars by having them both drive the same few hundred foot stretch of new road at the posted speed limit of 35 MPH? You will test some things doing that, but not particularly well, and hardly all the things people find interesting and useful for comparing the performance of cars.

To bring ng this back to the discussion at hand (and to be redundant, as it's been mentioned here already), there are many aspects of using an LLM that are not purely about the output from a single or few well formed prompts. Additionally, if the end results are very similar, these othrr aspects will have an outsized influence on people's perspective of the tools, as they're the only differences worth choosing one model over another.

alt Hacker News