So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.
I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
IPFS was an attempt at this: https://en.wikipedia.org/wiki/InterPlanetary_File_System
AI companies are _already_ funding and using residential proxies. Guess how much of those proxies are acquired through being compromised or tricking people into installing apps?
They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.
I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.
Also, I always wonder about Common Crawl:
Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
Even if the site is archived on IA, AI companies will still do the same.
AI browsers will be the scrapers, shipping content back to the mothership for processing and storage as users co browse with the agentic browser.
The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.
This is from my experience having a personal website. AI companies keep coming back even if everything is the same.