logoalt Hacker News

20ktoday at 5:58 PM1 replyview on HN

Yes, LLMs fundamentally operate as a lossy compression scheme for their training data. There's been countless examples of them reproducing their training data with very high accuracy

People claim that the data isn't stored, but clearly a representation of it is encoded and reproducible. I saw chatgpt word for word plagiarise a stack overflow comment just two days ago


Replies

nonethewisertoday at 6:16 PM

Does this actually imply a representation of it has been stored or simply that the model is sort of over-fit?

show 1 reply