Training data can't be the whole answer. LLMs are really good at translating to different progr...

gmueckl • today at 12:45 AM • 4 replies • view on HN

Training data can't be the whole answer. LLMs are really good at translating to different programming languages. This makes sense, given that they are derived from text translation systems. I'm getting great results in languages with comparatively small bodies of freely available code. The bigger hurdle is usually that LLMs tend to copy common idioms in the target language and if it is an "enterprise-y" language like Java or C#, the amount of useless boilerplate can skyrocket immediately, which creates a real danger that the result grows beyond the usable context window size and the quality suffers.

Replies

dnautics • today at 3:46 AM

> Training data can't be the whole answer.

Absolutely correct. Anthropic showed that 250 examples can "poison" an LLM -- independent of LLM activation count.

lanyard-textile • today at 12:58 AM

Very true.

I have to steer models hard for C++. They constantly suggest std::variant :P

➕ show 1 reply

jryio • today at 1:15 AM

In higher dimensional vector space, yes it can.

Dimensionality gets bizarre in 1000-D space. Similarity and orthogonality express themselves in strange ways and each dimension codes different semantic meaning.

Therefore, if the training data is highly consistent you are by definition reducing some complexity and/or encoding better similarity.

In Go the statement

    result, err := Storage.write(...)

Is almost always going to be followed by

    if err != nil { ... }

In a highly dynamic language you may not get

   try { Storage.write() } catch (error) { ... }

Unless explicitly asked for.

➕ show 2 replies

chromacity • today at 12:58 AM

> LLMs are really good at translating to different programming languages.

...for which ample training data is available.

> This makes sense, given that they are derived from text translation systems.

...for languages with ample training data available.

Yes, LLMs can combine information in novel ways. They are wonderful in many respects. But they make far more mistakes if they can't lean on copious amounts of training data. Invent a toy language, write a spec, and ask them to use it. They will, but they will have a hard time.

➕ show 3 replies

alt Hacker News

Replies