logoalt Hacker News

somattoday at 6:44 AM4 repliesview on HN

"When the software is being written by agents as much as by humans, the familiar-language argument is the weakest it has ever been - an LLM does not care whether your codebase is Java or Clojure. It cares about the token efficiency of the code, the structural regularity of the data, the stability of the language's semantics across releases."

Isn't familiarity with the language even more the case with a LLM. The language they do best with is the one with the largest corpus in the training set.


Replies

dgb23today at 7:24 AM

Familiarity matters to some degree. But there are diminishing returns I think.

Stability, consistency and simplicity are much more important than this notion of familiarity (there's lots of code to train on) as long as the corpus is sufficiently large. Another important one is how clear and accessible libraries, especially standard libraries, are.

Take Zig for example. Very explicit and clear language, easy access to the std lib. For a young language it is consistent in its style. An agent can write reasonable Zig code and debug issues from tests. However, it is still unstable and APIs change, so LLMs get regularly confused.

Languages and ecosystems that are more mature and take stability very seriously, like Go or Clojure, don't have the problem of "LLM hallucinates APIs" nearly as much.

The thing with Clojure is also that it's a very expressive and very dynamic language. You can hook up an agent into the REPL and it can very quickly validate or explore things. With most other languages it needs to change a file (which are multiple, more complex operations), then write an explicit test, then run that test to get the same result as "defn this function and run some invocations".

show 1 reply
ehntotoday at 6:53 AM

And they're very sensitive to new releases, often making it difficult to work with after a major release of a framework for example. Tripping up on minor stuff like new functions, changes in signatures etc.

A stable mature framework then is the best case scenario. New frameworks or rapidly changing frameworks will be difficult, wasting lots of tokens on discovery and corrections.

bilekastoday at 8:28 AM

Yes I'd agree from the perspective of the model that one cohesive well established language would be more reliable. The nightmare scenario is an enterprise suite with a Hodge podge mix of every language known to man all mangled together because the frontier model at the time decided Haskel would be the most efficient when compiled for webassembly and some poor intern has to fix a bug that should cost 100x less than rerunning the LLM to patch.

lelanthrantoday at 1:09 PM

> The language they do best with is the one with the largest corpus in the training set.

Up to a point, I guess? There must be a point of diminishing returns based on the expressiveness of the language

I mean, a language that has 8 different ways to declare + initialise composite variables needs to have a much larger training corpus than a language that has only 2 or 3 different ways.

The more expressive a language, the more different suitable patterns would be required, which results in a larger corpus being needed.