logoalt Hacker News

ctippetttoday at 1:03 AM6 repliesview on HN

Am I correct that this has come about because archive.org respects robots.txt and these sites have blocked their crawler from indexing their sites?

I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.


Replies

Paracompacttoday at 1:22 AM

Don't know if it helps your musings at all, but there's a good chance that if a high-profile crawler like archive.org disrespected their robots.txt, that archive.org would be faced with lawsuits (or some other form of pressure). This is not merely the most moral move; rather it is the only sensible move.

The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.

show 1 reply
cmeacham98today at 1:20 AM

Correct. Example snippet from the nytimes.com robots.txt:

    User-agent: archive.org_bot
    Disallow: /
show 1 reply
joecool1029today at 2:29 AM

No, archive.org does NOT respect robots.txt. You need to reach out to them directly and ask your site not be included: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

show 1 reply
userbinatortoday at 3:43 AM

It's the same idiocy that DRM created.

Be a pirate, because a pirate is free...

Gigachadtoday at 1:38 AM

It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them.

All of the LLMs would be massively less useful if it wasn't for scraping the latest news.

show 2 replies