logoalt Hacker News

syncsynchaltyesterday at 9:05 PM1 replyview on HN

UTF-16 has endian concerns and surrogates.

Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.


Replies

Mikhail_Edoshintoday at 4:12 AM

Here is what an UTF-8 decoder needs to handle:

1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.

3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.

4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

show 1 reply