I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10...

s1mplicissimus • yesterday at 9:09 PM • 3 replies • view on HN

I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?

Replies

gavinsyancey • yesterday at 9:57 PM

Generally you can assume byte-aligned access. So every byte of UTF-8 either starts with 0 or 11 to indicate an initial byte, or 10 to indicate a continuation byte.

pklausler • yesterday at 10:16 PM

10 never implies the start of a character; those begin with 0 or 11.

dbaupp • yesterday at 9:52 PM

UTF-8 encodes each character into a whole number of bytes (8, 16, 24, or 32 bits), and the 10 continuation marker is only at the start of the extra continuation bytes, it is just data when that pattern occurs within a byte.

You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).

alt Hacker News

Replies