logoalt Hacker News

hyperman1yesterday at 7:16 PM8 repliesview on HN

One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.

So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.

The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?

UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.


Replies

toast0yesterday at 7:50 PM

The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of

Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?

I suspect the answer is

a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings

b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction

You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:

> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

The included FSS-UTF that's right before the note does include additive constants.

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

show 2 replies
nostrademonsyesterday at 7:29 PM

I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.

In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.

rhet0ricayesterday at 7:25 PM

See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.

gpvosyesterday at 10:12 PM

That would make the calculations more complicated and a little slower. Now you can do a few quick bit shifts. This was more of an issue back in the '90s when UTF-8 was designed and computers were slower.

umanwizardyesterday at 7:25 PM

Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.

rightbyteyesterday at 7:26 PM

I think that would garble random access?