logoalt Hacker News

SwiftyBugyesterday at 1:09 PM2 repliesview on HN

I wonder how well Mandarin works for LLM-based programming. On one hand, it's very token efficient as Mandarin script is very dense in meaning. On the other, I suppose this can increase ambiguity.


Replies

arjieyesterday at 8:54 PM

Character-density and token-efficiency are different things. Latter is data and, therefore, tokenizer specific e.g. take GPT-5's tokenizer o200k_base and run mandarin text and its translation through. Some amount of the time en will beat zh. I just tested with news articles and wikipedia.

After all `def func():` is only 3 tokens on o200k_base.

jamesdutcyesterday at 8:20 PM

I can speak, read, and write Taiwanese Mandarin (which is likely relatively underrepresented in the training sets and, which is, in my practical experience, materially different in its usage.)

The authoritative answer for this question would best come from the millions (or tens of millions) of Chinese-speakers who are currently using LLMs to write software.

However, it is my suspicion that you would see no advantages using any language other than English. While there is a certain token-level density to written texts, it seems the benefits of this (and the more recent discussion around “caveman talk”) are quite limited.

Furthermore, consider that the vast majority of textbooks, technical documentation, blog posts, StackOverflow answers, &c. are originally in English. Historically, where these have been translated to Chinese, the translations have often been of very poor quality (and the terminology and phraseology is often incomprehensible unless you also understand some English.) I would suspect that this makes up the overwhelming majority of the training sets for these models.

That said, my experience using the most recent models, is that they are surprisingly language-agnostic in a way that surpasses readily-available human capability. For example, I can prompt the LLM to translate English into something that uses German grammar, Chinese vocabulary, and Japanese characters, and I'll get an output that is worse than what a human expert could do… but where am I going to find a multilingual expert?

(Of course, I have so far only ever been impressed that a model could generate an output but never impressed with the output it did generate. Everything—translations, prose, code—seems universally sloppy and bland and muddy.)

So what I would anticipate the biggest benefit for a Chinese-speaker today… is that if they are disinterested in working internationally, they have significantly less dependency on learning English.