Can someone explain why LLM's write like this when most humans don't?
Most people write badly. Much of the text on the public Internet is written by professional writers, who tend to write less badly.
When people use LLMs to generate text, they often ask it to write like a professional. (I haven't tried, but I assume that if you ask an LLM to write like a Reddit troll it will use a different set of forms.)
When you ask an LLM to write like a professional writer, it will aim to sound like a professional writer. They do in fact, and in speech, use words like "delve" and "robust" because they spend years cultivating their vocabularies.
Professional writers are comfortable with punctuation marks and know the difference between the em dash and the en dash, and when to use each versus other marks. (The typical non-professional cannot manage to use the apostrophe, much less the marks that require judgement.)
And a lot of them end up writing business content at some point in their careers. Which leads to an interesting mash where you may get "leverage" used as a verb alongside some of the other pattern tropes.
Because business writing is its own universe. LinkedIn has been swimming in content that would be flagged as LLM-generated for at least 10 years, long before ChatGPT landed.
I asked ChatGPT about that and it gave a nicely reasoned explanation on what AI produces compared to humans.
But that being said, the problem I think is that people treat the output from LLMs as final.
It should be treated more as idea generation or early draft to get over the “staring at a blank page” and get the creative juices flowing and creating your own content.
Having purely AI generated content and eventually feeding the algorithms and soon enough every sounds the same (already does in a lot of places).
Writing like this (say a technical blogpost) is supposed to communicate ideas effectively. Rhetoric, vocabulary, metaphors all aid this communication in good writing.
But the prompt is usually bereft of fully fleshed out ideas, so the LLM substitutes style in a futile attempt to amplify the signal.
Though maybe it’s not futile! HN voters eat this stuff up daily.
Most humans don’t, but maybe “most humans” do? As in, on average, as a collective, regressed to the mean of mediocrity and devoid of personality, we write like this? It’s not self-deprecating, it’s humbling.
Base models don't write like that. This appears during RLHF. It's not totally clear why*, but probably a large part of the answer is that this style looks great to human reviewers, and only starts looking terrible once you get to play around with the released model and realise it talks like that all the time.
* The technical term is "mode collapse", see [1][2]
[1] https://en.wikipedia.org/wiki/Mode_collapse
[2] https://gwern.net/doc/reinforcement-learning/preference-lear...
I suppose it might be because humans that use LLMs write like this.
Generally, the more you write (and especially, the more you write long form content), the better your writing becomes. This also goes in reverse. Those who have great trouble writing, are unlikely to do much of it.
This alone can account for the seeming disparity. Though many people write poorly, they do not write much text for public consumption at all.