> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.
I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.
> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z
Yes, it is 'truncated' to the "UTF-16 accessible range":
* https://datatracker.ietf.org/doc/html/rfc3629#section-3
* https://en.wikipedia.org/wiki/UTF-8#History
Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:
> It sacrifices the ability to encode more than 21 bits
No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).
That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.
In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.
It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).
If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.
the limitation tomorrow will be today's implementations, sadly.
That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?