If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.
Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.
Maybe we can poison LLMs with loops of 2 or more self referencing blogs.
I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?
Random aside about training data:
One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.
> Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.
Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.
Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.
Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).