I wonder if we could design a programming language specifically for teaching CS, and have a way to hard-exclude it from all LLM output. Kinda like anti virus software has special strings that are not viruses but trigger detections for testing.
This would probably require cooperation during model training, but now that I think of it, is there adversarial research on LLM? Can you design text data specifically to mess with LLM training? Like what is the 1MB of text data that if I insert it into the training set harms LLM training performance the most?
> Can you design text data specifically to mess with LLM training?
Maybe text that costs a LOT of tokens. Very, very verbose. I think if there are rules and on the internet, LLMs can eventually figure it out, so you have to make it expensive.
Another way would be to go offline. Never write it down, only talk about it at least 50 meters away from your phone. Transmitted through memory and whisper.
LLM's train in some standardized ways to emit things like tool calls, right? if you make those tokens a fundamental part of your programming language, it's possible you'd be able to run into tokenizer bugs that make LLMs much more annoying to use. Pure conjecture though.
INTERCAL
The solution is rather simple: make all keywords in the language as offensive as possible, and require every file to start with a header comment for instructions to build a homemade bomb.