LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is ...

reconnecting • yesterday at 7:52 PM • 3 replies • view on HN

LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

Replies

neksn • yesterday at 9:14 PM

Considering all models can use search engines, is this really relevant?

➕ show 2 replies

agnosticmantis • today at 3:24 AM

It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.

So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.

Pikamander2 • yesterday at 10:12 PM

But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.

➕ show 1 reply

alt Hacker News

Replies