UTF-8 is a brilliant design

483 points • by vishnuharidas • yesterday at 6:30 PM • 202 comments • view on HN

Comments

Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

➕ show 12 replies

twoodfin • yesterday at 7:37 PM

UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.

Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.

➕ show 9 replies

vintermann • yesterday at 8:54 PM

UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?

Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because

* It's not discrete. Some code points are for combining with other code points.

* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.

* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.

I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.

By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.

➕ show 3 replies

nostrademons • yesterday at 7:34 PM

For more on UTF-8's design, see Russ Cox's one-pager on it:

https://research.swtch.com/utf8

And Rob Pike's description of the history of how it was designed:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

➕ show 1 reply

hyperman1 • yesterday at 7:16 PM

One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.

So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.

The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?

UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.

➕ show 8 replies

happytoexplain • yesterday at 6:59 PM

I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

➕ show 3 replies

modeless • yesterday at 11:25 PM

UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.

➕ show 1 reply

rmccue • yesterday at 7:19 PM

Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/

Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)

➕ show 1 reply

twbarr • yesterday at 7:08 PM

It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.

➕ show 1 reply

3pt14159 • yesterday at 7:00 PM

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.

➕ show 3 replies

fleebee • yesterday at 10:04 PM

If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.

[1]: https://i18n-puzzles.com/

dotslashmain • yesterday at 8:08 PM

Rob Pike and Ken Thompson are brilliant computer scientists & engineers.

➕ show 1 reply

alberth • yesterday at 7:21 PM

I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

➕ show 1 reply

Mikhail_Edoshin • today at 3:46 AM

I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.

wrp • today at 12:53 AM

I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.

I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.

bruce511 • yesterday at 7:02 PM

While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

➕ show 1 reply

mikelabatt • yesterday at 10:50 PM

Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.

Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)

The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.

So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.

➕ show 1 reply

dmz73 • today at 4:08 AM

UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.

➕ show 1 reply

billforsternz • yesterday at 9:29 PM

A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before: The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.

➕ show 1 reply

blindriver • yesterday at 9:56 PM

It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.

digianarchist • today at 3:05 AM

I read online that codepoints are formatted with 4 hex chars for historical reasons. U+41 (Latin A) is formatted as U+0041.

sawyna • yesterday at 9:22 PM

I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?

➕ show 4 replies

Dwedit • yesterday at 8:39 PM

Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.

UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.

dpc_01234 • yesterday at 8:09 PM

UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.

I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.

➕ show 2 replies

smoyer • today at 1:38 AM

Uvarint also has the property of a file containing only ascii characters still being a valid ascii file.

zamalek • yesterday at 7:45 PM

Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.

fmajid • yesterday at 9:54 PM

Well, yes, Ken Thompson, the father of Unix, is behind it.

anthonyiscoding • yesterday at 9:13 PM

UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.

Most other standards just do the xkcd thing: "now there's 15 competing standards"

kevincox • yesterday at 11:28 PM

> Every ASCII encoded file is a valid UTF-8 file.

More importantly, that file has the same meaning. Same with the converse.

Andrex • today at 5:05 AM

What are the perceived benefits of UTF-16 and 32 and why did they come about?

I could ask Gemini but HN seems more knowledgeable.

jrochkind1 • today at 1:19 AM

It really is, in so many ways.

It is amazing how successful it's been.

cyberax • yesterday at 7:08 PM

UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.

➕ show 3 replies

sheerun • yesterday at 7:44 PM

I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story

➕ show 1 reply

ofou • today at 12:35 AM

UTF-8 should be a universal tokenizer

quotemstr • yesterday at 7:21 PM

Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.

carlos256 • yesterday at 8:23 PM

No, it's not. It's just a form of Elias-Gamma coding.

➕ show 1 reply

librasteve • yesterday at 7:31 PM

some insightful unicode regex examples...

https://dev.to/bbkr/utf-8-internal-design-5c8b

➕ show 1 reply

ceh56 • today at 12:01 AM

Another collaboration by Pike and Thompson can be seen here: https://go.dev/.

lyu07282 • today at 2:55 AM

UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.

burtekd • yesterday at 7:21 PM

I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4

postalrat • yesterday at 7:46 PM

Looks similar to midi

ummonk • yesterday at 8:27 PM

> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.

tiahura • yesterday at 8:49 PM

How many llm tokens are wasted everyday resolving utf issues?

Androth • today at 12:06 AM

meh. it's a brilliant design to put a bandage over a bad design. if a language can't fit into 255 glyphs, it should be reinvented.

LorenPechtel • yesterday at 9:58 PM

Now fix fonts! It should be possible to render any valid string in a font.

wetpaws • yesterday at 7:30 PM

[dead]

TacticalCoder • yesterday at 9:24 PM

[dead]

alt Hacker News

UTF-8 is a brilliant design

Comments