I haven't been following it well but isn't part of the NYT lawsuit against OpenAI that it sometimes spits out NYT articles verbatim?
Genome analysis is also a lossy process that chops the data up into tiny bits, like a newspaper sent through a shredder. We then piece together matching sequences in a sort of puzzle. It's often a relatively inaccurate solution. Then we try to do that again with a different copy of the newspaper sent through a different shredder. And again. A genome might be comprised of 10x reads, 30x reads, 100x reads, with more replications representing higher confidence.
There might be ten million people who have quoted Harry Potter at some point in their blogs or forum posts. There are only so many words in the books.
Study: Meta AI model can reproduce almost half of Harry Potter book
https://arstechnica.com/features/2025/06/study-metas-llama-3...
See also GEMA vs. OpenAI.
That issue is different, when web tools were added to gpt4o it would fetch the site, and basically copy paste the text into the answer body. So, you were able to read the content of the site without the site getting the ad impressions. Now the system prompts put a very tight word limit - 25? - on quotes from sites the model visits