The one-shot performance of their recall attempts is much less impressive. The two best-performing m...

DiogenesKynikos • today at 6:38 AM • 1 reply • view on HN

The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.

In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.

Replies

D-Machine • today at 6:54 AM

Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.

So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.

alt Hacker News

Replies