logoalt Hacker News

kzrdudeyesterday at 9:40 PM1 replyview on HN

utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.

So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)

It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..

---

Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.

1. The glyph or symbol ("A")

2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)

3. The utf-8 encoding of the code point, as bytes (0x41)


Replies

duskwuffyesterday at 11:44 PM

As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.