Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.