Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites ...

ninjagoo • yesterday at 6:46 PM • 2 replies • view on HN

Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites are blocking both IA and Common Crawl. As an example, https://www.realtor.com/news/celebrity-real-estate/james-van... is unarchivable, with IA being 429ed while the site is accessible otherwise.

Replies

trollbridge • yesterday at 7:45 PM

And whilst the IA will honour requests not to archive/index, more aggressive scrapers won't, and will disguise their traffic as normal human browser traffic.

So we're basically decided we only want bad actors to be able to scrape, archive, and index.

➕ show 1 reply

fc417fc802 • yesterday at 7:47 PM

Presumably someone has already built this and I'm just unaware of it, but I've long thought some sort of crowd sourced archival effort via browser extension should exist. I'm not sure how such an extension would avoid archiving privileged data though.

➕ show 1 reply

alt Hacker News

Replies