Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".
Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.
Edit: Obligatory reference to Borges's Tlön: https://en.wikipedia.org/wiki/Tl%C3%B6n,_Uqbar,_Orbis_Tertiu...
In addition to what others have pointed out, many of these aren't actually missing from traditional dictionaries: they're just inflected differently. So your example lists phrases like "operating systems", "immune systems" and "solar systems" as missing from traditional dictionaries, but at least the online OED and M-W have "operating system", "immune system" and "solar system" in them. It's just that your script is apparently listing the plural as a separate phrase.
On languages other than English: in general, different languages do word division very differently. At least in German and Dutch, many of those phrasal verbs are separable, meaning that they are one word in the infinitive but are multiple words in the present tense. So for example, where in English you would say "I log in to the website", in Dutch it would be "Ik log in op de website". "Log in" is two words in both cases, but in Dutch it's the separated form of the single-word separable verb inloggen ("I must log in now" = "Ik moet nu inloggen"). The verb is indeed separable in that the two words often don't end up next to each other: "I log in quickly" = "Ik log snel in".
Dutch, like German, has lots of compounds. But there are also agglutinative languages, which have even more complex compound words, perhaps comprising a whole sentence in another language. Eg (from Wikipedia) Turkish "evlerinizdenmiş" = "(he/she/it) was (apparently/said to be) from your houses" or Plains Cree "paehtāwāēwesew" = "he is heard by higher powers"; and these aren't corner cases, that's how the language works.
A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.
And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.
> But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.
(1) Who counted those? Whence those numbers?
(2) The examples are normal two-word phrases with one word modifying the other, often categorised as an adjective. The examples are counter-examples to the very claim made in that article.
(3) Using Clause to brainstorm s.t. is a weird thing to say...
(4) I would say the use of 'lexicalized' is wrong or at least uncommon. It usually refers to specialised semantics of something that could be interpreted generically, too. Like 'sleeping bag'. Or indeed 'cold feet'. Lexicalisation may involve deleting spaces, like 'hotdog'. And I am pretty sure lexicalised phrasal words are usually intensionally listed in dictionaries. And so 'ice' is not lexicalised 'frozen water', but it is not overtly a phrase but is a separate atomic word.
=> I don't get the point.
The author of this article just hasn’t been taught how to use a dictionary. The words aren’t “missing”, they’re just indexed under one of their parts. For example “wait upon” would be located within the entry for “wait”.
There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”
I would hope that none of those examples were taking up space in a dictionary.
Off the top of my head, peanut butter, black hole, and amusement park are concepts that can't be easily intuited by just combining the two singular terms, but I also wouldn't consider them as phrases.
One of the axes this analysis seems to be missing is the subtle spectrum from "multi-word expressions" to "idioms". Traditional lexicographers have long published separate idioms books, such as the Merriam-Webster New World American Idioms Handbook and the Oxford Dictionary of Idioms.
Wiktionary doesn't need to make that distinction between MWEs and Idioms and tends to conflate MWEs and Idioms as there is no separate "Wikidiom". Arguably, that multi-book confusion runs deep on the internet because Urban Dictionary should probably be fully titled the Urban Dictionary of Idioms and Slang.
It's not just page limits but also categorical limits and classic lexicographers would build multiple books/volumes, not just settle on one "dictionary". Classic scholars would often have a "reference shelf" with multiple dictionaries, books of idioms, thesauri, and more. The CD-ROM and then the internet has kind of tunnel visioned that this entire shelf can be merely "one app".
I'm currently reading Cormack McCarthy's Suttree (my first of his novels) — just an exceptional polymath capable of painting complicated scenery with words dozenly scattered throughout paragraphs [0].
My favorite adjective he's coördinated is "burntwing", used to describe moths spiraling downwards after passing through candleflames. If I had crafted such a descriptive contraction, my former styling would've been "burnt-wing", had I even been capable of generating such concise imagery [1].
McCarthy's stylings have helped me to reduce hyphenations in my own writings — reducing their usage mainly to contractedwords which might be all-too-confusing without them.
[0] pg104 has ten words that I do not know their definitions, yet through context they work to advance the storyline of character racists (book is set in 1950s).
[1] decades ago, during college burnout, I was searching for the essense of "burntwing" — reduced to writing a professor about "feeling like a burning airplane in tailspin." My trajectory back then was definitely burntwing.
This feels like ragebait (rage bait?) for people that enjoy language and words. The leading example is nonsense.
Is nobody going to mention that "taco [N WORD]" is one of the words there? (Third page from the end)
It appears to me that the author is trying too hard to make a point: "merry-go-round" is a single compound word that several dictionaries contain; "canned goods" is not commonly used[1] (more of a bureaucratic jargon), and people would just say "cans"(US) or "preserves" (UK); "household chores" is simply "chores", as the word is no longer commonly used outside the house context; "coffee break ritual" is not a concept in English-speaking countries so it would make no sense to have it in a dictionary, and so many of the examples are exactly that.
[1] I wonder how many here have ever been told something like "Prithee, husband, bring back a dozen canned goods from the market, for in the meanwhile I shall do my household chores".
While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"? So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.
If the first example was "monkey wrench" instead of "boiling water", we'd never have seen the article.
The name for these are "collocations".
Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.
No. This article shows a distinct lack of understanding of the basic building blocks of the English language.
"Words" don't have "spaces."
Phrases are made of words separated by spaces.
"Boiling Water" is not a word.
"Water" is a word. A noun, the subject.
"Boiling" is a word. An adjective, in this case. Which modifies the subject.
I don't know if you're trying to be clever, but you're not.
Dictionaries containing spaced compounds were not scalable with print media. The printed OED was encyclopedic in scale. Compound dictionaries are more than feasible now. Arguing whether a collection of commonly used words are expressions or concepts or even single "spaced words" is beside the point. Simply identify these differences and classify them in the compendium.
As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."
To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.
I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!
(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)
Examples of collocation dictionaries:
Two related compound words from a Norwegian dialect, both mean "fish food":
Fiskemat Fiskmat
The latter means food made from fish, the former means food for fish. Standard varieties of Norwegian only use the former to mean both, to the annoyance of many old fishermen.
This maybe illustrates why the author's examples such as boiling water aren't so weird. Yes, in English it means water that's boiling, but you have to know that. It could for instance have meant water for boiling, like "cooling water" means water for cooling say in a nuclear reactor, not water which is in the process of getting cool.
I disagree these belong in a traditional dictionary.
I could, however be convinced these could be documented/defined in a separate document, especially from the perspective you are coming from (word games).
This boils down to an "is Pluto a planet" debate.
We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.
in German, they just remove the spaces and keep the word, and this problem is solved:
Entschädigungsleistungen - compensation benefits
Wiederbeschaffungskosten - replacement value
Kraftfahrzeughaftpflichtversicherung - motor vehicle liability insurance
Donaudampfschifffahrtsgesellschaftskapitän - Danube steamboat captain
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz - beef labeling regulation law
I'd point folks to the concept of "Construction Grammar", which is related to this problem: https://en.wikipedia.org/wiki/Construction_grammar
I don't think 'Words with spaces' is a thing.
I think maybe the word the author is looking for is 'phrase'
Examples of "obscure" compound words include "list uids", "beg pos", "sync binlog", "gfp mask", "av fetch", "str idx", "seq ptr", "ai family", "fmt vuln", "ai socktype", "curr tok", "nbits set", "ini get", "s1 s2", "in addr", "num get", "res init", "sess ref", and "ai addrlen".
Well I can't even.
There are an infinite number of describable concepts that don't get a specific word. That doesn't mean the whole description is a "word with spaces."
It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together.
Even though there isn't a specific word for that, I wouldn't say "It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together" actually is one big word with spaces in it.
It's a bunch of words together that carry a more specific meaning when put together in that order.
Isn't this the difference between a dictionary and an encyclopedia?
I imagine that languages like german that create composites of nouns have less of a problem with this:
English: cream of mushroom soup
Spanisch: sopa cremosa de champiñones
German: Champignoncremesuppe
“Hospital bills”. That’s very country specific. Also, that’s two words.
If the compound words all have single word entries in the dictionary that when combined mean the same thing what is the point?
Water: transparent, odorless, tasteless liquid
Boiling: having reached the boiling point
Boiling Water: transparent, odorless, tasteless liquid which has reached the boiling point
If Boiling Water had some other completely different meaning that has nothing to do with the individual words then sure, maybe, otherwise this is completely redundant and opinionated.
Hah, I wonder how thick a German, Dutch or Afrikaans dictionary would be if it included all possible spaceless compound words. Literally any concept can be compounded together to make a new word.
Roovleisslaghuisinspekteur =
Rooi = red
Vleis = meat
Slag = butcher
Huis = house
Inspekteur = inspector
"Inspector who controls the quality of red meat in butcheries"
sometimes singular semantic concepts can take multiple syntactic words to express. Why not call this idea something other than “word”?
Dictionaries are a mixed bag at best. If you apply David Kaplan’s character/content distinction from Demonstratives, you have to ask: should pure indexicals, which are essentially 'contentless' pointers be treated the same way as standard words? Let alone the thousands of rigid designators in this dataset that map directly to specific objects in the real world. At a certain point, is there no room left for encyclopedias?
I got into solving the NYT crossword during Covid. I couldn’t solve a Monday when I started; now I do Mondays downs-only and look forward to Saturdays. Along the way, I developed a sixth sense for when an answer will be more than one word. I’ve thought a lot about it and can’t really describe how I do it. (Some other puzzles clarify if an answer spans multiple words, but I find the ambiguity adds to the fun.)
> Got a word Didn’t
> frozen water → ice boiling water
Freezing water doesn’t have a word. Boiled water does have a word.These are under-respected for non native English speakers.
"to be" is a very weird example because that's just the full infinitive of "be" which is definitely in dictionaries: https://www.merriam-webster.com/dictionary/be
these are called phrases
>Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.
"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin
With Twain in mind, might I suggest we adopt the simple expedient of snake casing such terms.
"book steaks" is in the list, but I don't think it' real. Maybe it was supposed to be "stack".
Fascinating! I’d add “word nerd” to the list to describe the authors.
Clearly those Irish monks are to blame.
Very cool project! Reminds me Chiang's great short story 'The Truth of Fact, the Truth of Feeling':
> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?
> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”
On another note, I always wished "never mind" was spelled "nevermind"
"Opaque MWE"? Does no one know the word "idiom"?
[dead]
> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter
I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"