logoalt Hacker News

linguaeyesterday at 7:07 PM3 repliesview on HN

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!


Replies

pezezintoday at 7:01 AM

I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.

show 1 reply
layer8yesterday at 9:14 PM

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...