logoalt Hacker News

recursivecaveatyesterday at 5:16 AM2 repliesview on HN

I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well.


Replies

mikepavoneyesterday at 11:53 PM

As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.

There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.

reppapyesterday at 6:55 PM

Stop just making up excuses for these companies. Other comments on this story have showed the bots are using openai user agents and making requests from openai owned ip ranges.

show 1 reply