This is the same calculation behind the observation "you spend much longer in front of red traffic lights than green ones".
It's an interesting observation, but it's playing games with the meaning of "mean latency" and I'm not sure this is a very helpful way to look at requests to a web service - slightly slowing the fastest responses to requests would improve your time-weighted mean latency.
It's a better metric for looking at outages - instantaneous outages don't affect anyone, and time-weighting correctly discards them. On the other hand, average outage length is a very suspect and gameable metric unless accompanied by uptime %.
Considering other metrics then p99 for user impact is unwise. All users will at some point experience a <1% request, it's not like half of all users will only send requests what will be under your median latency, some of their requests will hit your worst-case.
By focusing on the tail and optimizing worst cases you help users more than by improving your median latency.
> Alice says your service is slow. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s.
There are also plenty of situations where a service can have a bimodal performance distribution and the impact of that can fall on certain users disproportionately.
Imagine a retail website that serves images from a global CDN, with cache misses pulled from a server in the EU. Users who visit our homepage, or look at our bestselling products, get a cache hit from the CDN node close to them, in 50ms. But users who look at our long-tail products get a cache miss - and if they're not near Europe, they'll get a noticeable delay.
Hence our mean image load time is 100ms - but a customer browsing an obscure product category for their location can experience markedly worse performance. If Alice is the only person in Costa Rica looking at ski equipment in June, she's going to get a lot of cache misses.
> More technically, what’s going on here is the inspection paradox. Alex and Alice don’t experience your latency distribution , they experience a t-weighted version of it
Ooh I got pushed in the 2m end of the pool there. What is the intuition? The ten hundred most popular words sort of thing.
I am very interested in this article though. At first I assumed it would be about TTFB vs. time to render the page after all those async useEffects have run, but it isn't that this is something else and I am very interested.
I don’t remember any service I used in the last couple of years, where I thought to myself: this service is really fast and responsive. Great experience.
Quite the contrary: feels like everything got worse. Sometimes painfully slow, buggy and unreliable.
This article contains very little substance. Show me the math!
Thank you for writing this article, there's a deep and powerful insight illustrated here: An observer using the system experiences different statistics from the system operator. By extension, taking an average of observer experiences leads to different conclusions from taking an average of system performance. One must not confuse the two when designing systems.
There is a branch of math dedicated to (among other things) truthfully estimating the waiting time, called queueing theory. I wonder why it wasn't mentioned in the article.
This feels similar to how nuclear power is perceived, contrasting deaths per TWh and the long tail effect of a rare but serious accident.
I found this article to be a very poor explanation of what (I think) it’s trying to say.
I think the point the article ought to be making is much better handled entirely separately for request time and for outages. For outages, it goes something like this: if you have a 1 hour outage, and your user notices that outage, they think you had a 1 hour outage. [0] If you do statistics that observe that you also had ten thousand 1 second outages and thus had an MTTR of under two seconds, this does not excuse your 1 hour outage in the slightest. And the longer an outage is, the more likely that any given user interacts with your service during the outage.
But the article is oddly caught up in this t-weighting idea, without justification. What does the statement that “Alex and Alice experience E_a[X]” even mean. What’s X? Is it the distribution of request times? If so, then I don’t see the article’s point — if I, a user, sample a bunch of requests, I recover an approximation of X, not X^2. And I really hope that X is not intended to be the distribution of outage lengths because I think the conclusion is just wrong as I alluded to above. Sure, if I happen to sample your service during an outage, the probability that I sample any specific outage is proportional to the length of that outage, but what about all the times that your are (hopefully) not having an outage? What if you have two consecutive outages that are so close to each other in time that I don’t think you recovered?
It would be entertaining to make an outage website. I’d pick a distribution over outage lengths. At time 0 I would sample that distribution, get an outage length t, and declare myself down until time t. All requests during that “outage” would report “hey, I’m down, and my outage length is t”. After the “outage” the site would sample again and repeat. This would give the answer in the article. But this is, of course, absurd.
[0] To be pedantic, they may not notice the beginning of the outage. This is a constant factor correction.
I've grown to dislike the typical tail measurements completely. What I usually look at these days is what share of unique users experience an "unacceptable experience" over a measurement period instead.
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
Is the formula for E_a[X] trivial? I don't see it immediately...
Interesting you work at Amazon and show how end user experience weights to their pessimal experience.
So.. apply that to Amazon design heuristics like author name search on books, and how Amazon return "in the style of" and "not a book but this guy called Charles Dickens makes jigsaws" as high order matches and consider how the end user experience weights to the pessimal yet Amazon can show on average they make more money doing this..
(Understood that engineers and AWS don't influence UX in the storefront or search)
My understanding is a lot of the probability puzzles in Allen Downey's Probably Overthinking It [1] also boil down to similar selection effects (the inspection paradox is definitely in there). There is a lot of cool stuff in that book (and his blog of the same name).
1. https://greenteapress.com/wp/probably-overthinking-it/