logoalt Hacker News

amlutotoday at 5:06 AM0 repliesview on HN

I found this article to be a very poor explanation of what (I think) it’s trying to say.

I think the point the article ought to be making is much better handled entirely separately for request time and for outages. For outages, it goes something like this: if you have a 1 hour outage, and your user notices that outage, they think you had a 1 hour outage. [0] If you do statistics that observe that you also had ten thousand 1 second outages and thus had an MTTR of under two seconds, this does not excuse your 1 hour outage in the slightest. And the longer an outage is, the more likely that any given user interacts with your service during the outage.

But the article is oddly caught up in this t-weighting idea, without justification. What does the statement that “Alex and Alice experience E_a[X]” even mean. What’s X? Is it the distribution of request times? If so, then I don’t see the article’s point — if I, a user, sample a bunch of requests, I recover an approximation of X, not X^2. And I really hope that X is not intended to be the distribution of outage lengths because I think the conclusion is just wrong as I alluded to above. Sure, if I happen to sample your service during an outage, the probability that I sample any specific outage is proportional to the length of that outage, but what about all the times that your are (hopefully) not having an outage? What if you have two consecutive outages that are so close to each other in time that I don’t think you recovered?

It would be entertaining to make an outage website. I’d pick a distribution over outage lengths. At time 0 I would sample that distribution, get an outage length t, and declare myself down until time t. All requests during that “outage” would report “hey, I’m down, and my outage length is t”. After the “outage” the site would sample again and repeat. This would give the answer in the article. But this is, of course, absurd.

[0] To be pedantic, they may not notice the beginning of the outage. This is a constant factor correction.