logoalt Hacker News

What we learned from a 22-Day storage bug (and how we fixed it)

36 pointsby mmcclurelast Monday at 4:34 PM7 commentsview on HN

Comments

altairprimetoday at 2:29 PM

> During this incident, we discovered we had crossed a scale threshold where our log ingestion pipeline was being rate-limited and quietly discarding logs. Ironically, we ended up with less information as a result, which made it significantly harder to reconstruct what was actually happening.

Last year they posted about using New Relic, Datadog, and Grafana. Would this ‘silent deletion of log data due to quota’ problem be characteristic of any one of them in particular, or is it something we have to watch out for with all of them?

show 2 replies
mannyvtoday at 5:24 PM

Why bother transcoding on the fly? Storage is cheaper than CPU and the work it takes to determine what needs encoding is excessive.

It implies that you guys are generating the playlists on the fly, tracking the client requests, then feeding that over to your transcoder - which then needs to get the original, seek, and transcode. Why bother?

show 1 reply
pooplord69today at 3:05 PM

“We didn’t handle errors, didn’t have logs, and now we do cuz next time” saved you a few mins

robutsumetoday at 4:01 PM

[dead]