logoalt Hacker News

everybodyknowsyesterday at 8:57 PM2 repliesview on HN

Following your link above, https://openai.com/gptbot

> ChatGPT-User is not used for crawling the web in an automatic fashion. Because these actions are initiated by a user, robots.txt rules may not apply.

So, not AI training in this case, nor any other large-batch scraping, but rather inference-time Retrieval Augmented Generation, with the "retrieval" happening over the web?


Replies

Tharretoday at 2:39 AM

Those would have the user agent "ChatGPT-User" though, and I barely see those. The majority comes from "GPTBot" like in my excerpt above, which makes it pretty clear that it's used for some sort of training:

"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."

groby_byesterday at 9:22 PM

Likely, at least for some. I've caught various chatbots/CLI harnesses more than once inspecting a github repo file by file (often multiple times, because context rot)

But the sheer volume makes it unlikely that's the only reason. It's not like everybody has constantly questions bout the same tiny website.