There are so many better data sources that AI labs can use here that this argument really holds no water at all.
Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.
The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.
I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.
This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.
But they don’t, generally. Which is why it is a great argument, because it’s easy to falsify - and see it is what is actually happening.
Also, those other sources are getting buried in AI slop too.