It's too bad we didn't go down the XHTML/semantic web route twenty years ago. Stric...

echelon • 05/04/2025 • 12 replies • view on HN

It's too bad we didn't go down the XHTML/semantic web route twenty years ago.

Strict documents, reusable types, microformats, etc. would have put search into the hands of the masses rather than kept it in Google's unique domain.

The web would have been more composible and P2P. We'd have been able to slurp first class article content, comments, contact details, factual information, addresses, etc., and built a wealth of tooling.

Google / WhatWG wanted easy to author pages (~="sloppy markup, nonstandard docs") because nobody else could "organize the web" like them if it was disorganized by default.

Once the late 2010's came to pass, Google's need for the web started to wane. They directly embed lifted facts into the search results, tried to push AMP to keep us from going to websites, etc.

Google's decisions and technologies have been designed to keep us in their funnel. Web tech has been nudged and mutated to accomplish that. It's especially easy to see when the tides change.

Replies

GuB-42 • 05/04/2025

As a programmer, I really liked XHTML because it meant I could use a regular XML parser/writer to work with it. Such components can be made small and efficient if you don't need the more advanced features of XML (ex: schemas), on the level of JSON. I remember an app I wrote that had a "print" feature that worked by generating an HTML document. We made it XHTML, and used the XML library we already used elsewhere to generate the document. Much more reliable than concatenating strings (hello injections!) and no need for an additional dependency.

In addition, we used XSLT quite a bit too. It is nice being able to open your XML data files in a web browser and having it nicely formatted without any external software. All you needed was a link to the style sheet.

geoffmunn • 05/04/2025

The thing I liked the most about XHTML was how it enforced strict notation.

Elements had to be used in their pure form, and CSS was for all visual presentation.

It really helped me understand and be better at web development - getting the tick from the XHTML validator was always an achievement for complicated webpages.

kweingar • 05/04/2025

I don't think there was ever a sustainable route to a semantic web that would work for the masses.

People wanted to write and publish. Only a small portion of people/institutions would have had the resources or appetite to tag factual information on their pages. Most people would have ignored the semantic taxonomies (or just wouldn't have published at all). I guess a small and insular semantic web is better than no semantic web, but I doubt there was a scenario where the web would have been as rich as it actually became, but was also rigidly organized.

➕ show 2 replies

alabastervlog • 05/04/2025

I kinda bailed on being optimistic/enthusiastic about the Web when xhtml wasn't adopted as the way forward.

It was such a huge improvement. For some reason rather than just tolerating old tag-soup mess while forging the way for a brighter future, we went "nah, let's embrace the mess". WTF.

It was so cool to be able to apply XML tools to the Web and have it actually work. Like getting a big present for Christmas. That was promptly thrown in a dumpster.

tannhaeuser • 05/04/2025

The "semantic" part was what eventually became W3C's RDF stuff (a pet peeve of TBL's predating even the Web). When people squeeze poetry, threaded discussion, and other emergent text forms into a vocabulary for casual academic publishing and call that "semantic HTML", that still doesn't make it semantic.

The "strict markup" part can be (and always could be) had using SGML which is just a superset of XML that also supports HTML empty elements, tag inference, attribute shortforms, etc. HTML was invented as SGML vocabulary in the first place.

Agree though that Google derailed any meaningful standardization effort for the readins you stated. Actually, it started already with CSS and the idioticy to pile yet another item-value syntax over SGML/HTML, when it already has attributes for formatting. The "semantic HTML" postulate is kind of just an after-the-fact justification for insane CSS complexity that could grow because it wasn't part of HTML proper and the scrutinity that goes with introducing new elements or attributes with it.

riffraff • 05/04/2025

I kinda agree with you but I'd argue the "death" of microformats is unrelated to the death of XHTML (tho schema.org is still around).

You could still use e.g. hReview today, but nobody does. In the end the problem of microformats was that "I want my content to be used outside my web property" is something nobody wants, beyond search engines that are supposed to drive traffic to you.

The fediverse is the only chance of reviving that concept because it basically keeps attribution around.

➕ show 1 reply

tsimionescu • 05/04/2025

The semantic web is a silly dream of the 90s and 00s. It's not a realizabile technology, and Google basically showed exactly why: as soon as you have a fixed algorithm for finding pages on the web, people will start gaming that algorithm to prioritize their content over others'. And I'm not talking about malicious actors trying to publish malware, but about every single publisher that has theoney to invest in figuring out how and doing it.

So any kind of purely algorithmic, metadata based retrieval algorithm would very quickly return almost pure garbage. What makes actual search engines work is the constant human work to change the algorithm in response to the people who are gaming it. Which goes against the idea of the semantic web somewhat, and completely against the idea of a local-first web search engine for the masses.

➕ show 2 replies

int_19h • 05/04/2025

Me personally, I didn't even care that much about strict semantic web, but XML has the benefits of the entire ecosystem around it (like XPath and XSLT), composable extensibility in form of namespaces etc. It was very frustrating to see all that thrown out with HTML5, and the reasoning never made any sense to me (backwards compatibility with pre-XHTML pages would be best handled by defining a spec according to which they should be converted to XHTML).

➕ show 1 reply

seanhunter • 05/04/2025

That’s not how the history went at all. When I worked at an internet co in the late 1990s (ie pre google’s dominance) SGML was a minority interest back then. We used to try to sell clients on an intranet based on SGML because of the flexibility etc and there was little interest and sloppy markup and incorrect html was very much the norm on the web back then (pre chrome etc)

bazoom42 • 05/04/2025

XHTML was just a more strict syntax for HTML. It didnt make it any more semantic.

safety1st • 05/04/2025

I'm as big a critic of Google as anyone, but I'm always surprised at modern day takes around the lost semantic web technologies - they are missing facts or jumping to conclusions in hindsight.

Here's what people should know.

1) The failure of XHTML was very much a multi-vendor, industry-wide affair; the problem was that the syntax of XML was stricter than the syntax of HTML, and the web was already littered with broken HTML that the browser vendors all had to implement layers of quirk handling to parse. There was simply no clear user payoff for moving to the stricter parsing rules of XML and there was basically no vendor who wanted to do the work. To my memory Google does not really stand out here, they largely avoided working on what was frequently referred to as a science project, like all the other vendors.

2) In subsequent years, Google actually has actually delivered a semantic web of sorts: https://developers.google.com/search/docs/appearance/structu...

A few things stand out as interesting. First of all, the old semantic web never had a business case. JSON+LD Structured Data does: Google will parse your structured data and use it to inform the various snippets, factoids, previews and interactive widgets they show all over their search engine and other web properties. So as a result JSON+LD has taken off massively. Millions of websites have adopted it. The data is there in the document. It is just in a JSON+LD section. If you work in SEO you know all about this. Seems to be quite rare that anyone on Hacker News is aware of it however.

Second interesting thing, why did we end up with the semantic data being in JSON in a separate section of the file? I don't know. I think everyone just found that interleaving it within the HTML was not that useful. For the legacy reasons discussed earlier, HTML is a mess. It's difficult to parse. It's overloaded with a lot of stuff. JSON is the more modern thing. It seems reasonable to me that we ended up with this implementation. Note that Google does have some level of support for other semantic data, like RDFa which I think is directly in the HTML - it is not popular.

Which brings us to the third interesting thing, the JSON+LD schemas Google uses, are standards, or at least... standard-y. The W3C is involved. Google, Yahoo, Yandex and Microsoft have made the largest contributions to my knowledge. You can read all about it on schema.org.

TL;DR - XHTML was not a practical technology and no browser or tool vendor wanted to support it. We eventually got the semantic web anyway!

➕ show 3 replies

DemocracyFTW2 • 05/04/2025

As someone who worked in the field of "semantic XML processing" at the time I can tell you that while the "XML processing" part was (while full of unnecessary complications) well understood, the "semantic" part was purely aspirational and never well understood. The common theme with the current flurry of LLMs and their noisy proponents is that it is, in both cases, possible to do worthwhile and impressive demos with these technologies and also real applications that do useful things, but people who have their feet on the ground know that XML doesn't engender "semantics" and LLMs are not "conscious". Yet the hype meddlers keep the fire burning by suggesting that if you just do "more XML" and build bigger LLMs, then at some point real semantics and actual conscience will somehow emerge like a hatching chicken from the egg. And, being emergent properties, who is to say semantics and conscience will not emerge, at some point somehow? A "heap" of grains is emergent after all, and so is the "wetness" of water. But I have strong doubts about XHTML being more semantic than HTML5.

And anyway, even if Google had nefarious intentions and even if they managed to steer the standardization, one has also to concede that all search engines before Google were encumbered by too much structure, too rigid approaches. When you were looking for a book in a computerized library at that point it was standard to be sat in front of a search form with many, many fields; one for the author's name, one for the title and so forth, and searching was not only a pain, it was also very hard to do for a user without prior training. Google had demonstrated it could deliver far better results with a single short form field filled out by naive users that just plonked down three or five words that were on their mind et voila. They made it plausible that instead of imposing a structure onto data at creation time maybe it's more effective to discover associations in the data at search time (well, at indexing time really).

As for the strictness of documents, I'm not sure what it will give you what we don't get with sloppy documents. OK web browsers could refuse to display a web page if any one image tag is missing the required `alt` attribute. So now what happens, will web authors duly include alt="picture of a cat" for each picture of a cat? Maybe, to a degree, but the other 80% of alt tags will just contain some useless drivel to appease the browser. I'm actually more for strict documents than I used to be, but on the other hand we (I mean web browsers) have become quite good at reconstructing usable HTML documents from less-than perfect sources, and the reconstructed source is also a strictly validating source. So I doubt this is the missing piece; I think the semantic web failed because the idea never was strong, clear, compelling, well-defined and rewarding enough to catch on with enough people.

If we're honest, we still don't know, 25 years later, what 'semantic' means after all.

alt Hacker News

Replies