logoalt Hacker News

oceanskytoday at 4:26 PM0 repliesview on HN

Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.

But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.