Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.
Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).
I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.
Time to start including deliberate bugs. The correct version is in a private repository.
This has been happening before LLMs too.
I don't really get why they need to clone in order to scrape ...?
> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)
The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.
Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)