logoalt Hacker News

__turbobrew__today at 2:36 AM0 repliesview on HN

I think it depends a lot on the operational maturity of the company. Some places are running the LGTM observability stack, sentry for error reporting, 24/7 on call rotations, playbooks for all alerts, etc. Those organizations will have less issues running systems like temporal because the operational framework is already there.

Other orgs have never heard of alerts or error reporting and naturally will not catch issues until they are catastrophic (for example services that crash frequently in the background go unnoticed until the crash frequency causes a catastrophic failure). In my experience a lot of issues are pretty simple such as running out of memory, CPU throttling, crashes caused by simple bugs (nil panics). If you have good observability you can catch those issues early.

For example: people rag on Ceph that their cluster somehow got into a broken state, but that really only occurs when abuse of the ceph cluster has went on long enough that the cluster finally reaches the tipping point where it is unrecoverable. If you set ceph up, follow the correct replication rules so components are spread across failure domains, and use the metrics and alerts that are distributed with ceph it is actually quite hard to break the cluster.