logoalt Hacker News

coldteayesterday at 8:52 PM2 repliesview on HN

>Something related, but different, happened with chardet. The current maintainer reimplemented it from scratch by only pointing it to the API and the test suite.

Only "pointing it". But the LLM, who can recite over 90% of a book in its training set verbatim *, would have also have had trained on the original code.

Maybe "the slop of Theseus" is a better title.

* https://the-decoder.com/researchers-extract-up-to-96-of-harr...


Replies

logicprogyesterday at 10:46 PM

Also from that exact same study (why not cite the actual study? It's quite readable) the LLMs couldn't recite more than a small fraction of many other books, often ones just as well known[0] — in fact, from the bar charts shown in the exact news article you cited, it's pretty clear that Sonnet 3.7 was a massive outlier, and so was Harry Potter and the Sorcerer's Stone, so it really seems to me like that's an extremely unrepresentative example, and if all the other LLMs couldn't recite even a small fraction of all the other books except that one outlier pairing, despite them being widely reproduced classics, why would we expect LLMs to actually regurgitate regularly, especially a relatively unknown open source project that probably hasn't been separately reproduced that many times?

Not to mention the fact that, as the other commenters mention, that appears to just... not have happened at all in this case, so it's a moot point.

[0]: https://arxiv.org/pdf/2601.02671

the_mitsuhikoyesterday at 9:06 PM

Maybe, but the LLM did not recite the chardet source code so that argument does not appear to apply here.

show 3 replies