logoalt Hacker News

toast0yesterday at 7:50 PM2 repliesview on HN

The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of

Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?

I suspect the answer is

a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings

b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction

You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:

> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

The included FSS-UTF that's right before the note does include additive constants.

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt


Replies

hyperman1yesterday at 8:29 PM

Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.

I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right

layer8yesterday at 8:59 PM

A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.

show 1 reply