logoalt Hacker News

jedimastertyesterday at 7:05 PM10 repliesview on HN

You know I was actually really curious about this so I went back to the HTML and URL W3C standards and surprisingly they don't actually have any definitions of format other than being percent encoded. One might conflate query strings with "form-urlencoded"[0] query strings, which is one potential interoperability format, but in general a queries string is just any percent encoded string following a "?" in a url[1], and just another property in the "URL" HTML object that can be used in the generation of a response. While additionally there is a URLSearchParams object that is the result of parsing the query string with the form-urlencoded parser, this is simply an interoperability layer for JavaScript.

I'm going to be honest, I was pretty geared up to have a contrarian opinion until I looked at the standards but they're actually pretty clear, a 404 could be a proper response to unexpected query string; query string is as much part of the URL API as the path is and I think pretty much everyone can acknowledge that just tacking random stuff onto the path would be ill advised and undefined behavior.

[0]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...

[1]: https://url.spec.whatwg.org/#url-class


Replies

wongarsuyesterday at 9:00 PM

Back in the day it was reasonably common for CMSs and forums to only have an index.php, and routing entirely by query string (in form-urlencoded form, people were not savages). So you would have index.php?p=home and index.php?p=shop. Or index.php?action=showthread&forum=42&thread=17976. It should be immediately obvious that in that scheme 404 is indeed the correct answer to unknown query parameters

In fact lots of sites still work like that, they just hide it behind a couple rewrite rules in apache/nginx for SEO reasons

show 7 replies
chrismorgantoday at 2:47 AM

Yeah, URLs really don’t have much in the way of semantics. Path is clearly intended for hierarchical data and query for non-hierarchical data, and there are strong customs, some commonly supported or even enforced by libraries, but no actual rules. Ultimately, it’s just a string that the server can decide what to do with.

The really funny thing about this is that, when I was worrying about possible side effects if I responded 404, I somehow completely forgot how much of the web’s history the path has been useless for. Paths have won. No one really starts new things with URLs like /item?id=… any more. Yay!

show 1 reply
dylan604today at 4:24 AM

Wouldn't a generic 400 be better. It's not that the page wasn't found, but you've sent something that was not an accepted request. Fix your request and try again is how I've read it, and that's how I use it in the APIs I provide. I prefer it over 406 since it's not my end that can't process it. If your query string is tacking extra stuff trying to break things or just because your request wasn't crafted per the docs, then it's on you.

show 1 reply
qilleryesterday at 10:48 PM

Interestingly, quite a few places that should treat query strings transparently make a lot of assumptions about their structure. We ran into that when picking a new CDN, some providers didn't handle repeat parameters (?a=1&a=2) correctly.

show 1 reply
nofriendtoday at 1:27 AM

Standards are just commonly accepted behaviour that somebody chose to write down somewhere. There are a great number of commonly accepted behaviours that nobody's ever bothered to encode into a formal standard, but where failure to follow the accepted practice will result in widespread breakage. There are also a great many "standards" that you would be a fool to follow to the letter. In the OP case, the only thing that will break is people trying to visit their site, who will presumably simply press the back button on their browser and go about their day. They can decide for themselves if that is an acceptable casualty. But it isn't definitionally acceptable because no standard says it isn't (nor would is suddenly become unacceptable because a standard said it was...)

socalgal2today at 8:17 AM

The No-Vary-Search (proposal?)

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

effectively lets you specify what parts of a query are relevant. So for example

url?a=b&c=d matches url?c=d&a=b in terms of caching

bawolffyesterday at 11:38 PM

> I was pretty geared up to have a contrarian opinion until I looked at the standards but they're actually pretty clear, a 404 could be a proper response to unexpected query string; query string is as much part of the URL API as the path is and I think pretty much everyone can acknowledge that just tacking random stuff onto the path would be ill advised and undefined behavior.

This feels like a technically correct is the best kind of correct situation. Like technically, yeah web servers may respond 404 if they dont understand a query parameter, but in practise that is not how urls are conceptualized normally.

ompogUeyesterday at 11:18 PM

Something I discovered looking back at some old sites: "pages" defined by URL params don't always make it into the Wayback Machine.

nrdsyesterday at 8:57 PM

Wait until you realize that the difference between path and query string is entirely arbitrary and decided by the server. Query strings should never have existed. They are an implementation detail of CGI webservers that leaked all over everything and now smells really bad.

show 6 replies
TZubiritoday at 12:16 AM

Whatwg is for html, try the IEEE http rfcs