UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.
Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.
UTF-8 is as good as a design as could be expected, but Unicode has scope creep issues. What should be in Unicode?
Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because
* It's not discrete. Some code points are for combining with other code points.
* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.
* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.
I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.
By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.
For more on UTF-8's design, see Russ Cox's one-pager on it:
https://research.swtch.com/utf8
And Rob Pike's description of the history of how it was designed:
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.
Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/
Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)
It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.
I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.
If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.
Rob Pike and Ken Thompson are brilliant computer scientists & engineers.
I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.
I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.
I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.
While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.
In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.
Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.
Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)
The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.
So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.
UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.
A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before: The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.
It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.
I read online that codepoints are formatted with 4 hex chars for historical reasons. U+41 (Latin A) is formatted as U+0041.
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.
I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.
Uvarint also has the property of a file containing only ascii characters still being a valid ascii file.
Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.
Well, yes, Ken Thompson, the father of Unix, is behind it.
UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.
Most other standards just do the xkcd thing: "now there's 15 competing standards"
> Every ASCII encoded file is a valid UTF-8 file.
More importantly, that file has the same meaning. Same with the converse.
What are the perceived benefits of UTF-16 and 32 and why did they come about?
I could ask Gemini but HN seems more knowledgeable.
It really is, in so many ways.
It is amazing how successful it's been.
UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).
The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.
Oh yes, and Python 3 should have known better when it went through the string-bytes split.
I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story
UTF-8 should be a universal tokenizer
Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.
Another collaboration by Pike and Thompson can be seen here: https://go.dev/.
UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.
I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4
Looks similar to midi
> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.
ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.
How many llm tokens are wasted everyday resolving utf issues?
meh. it's a brilliant design to put a bandage over a bad design. if a language can't fit into 255 glyphs, it should be reinvented.
Now fix fonts! It should be possible to render any valid string in a font.
[dead]
[dead]
Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.
If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.
[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4