Well, I am running the stack in production right now, but everyone has a different understanding of what that actually means...
Do you have concrete examples of these catastrophic failures? I've personally havent experienced any myself during these years, but I'm doing very boring and typical stuff, so wouldn't surprise me there was hard edges still.
There's a difficult distinction here, you're right.
Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.
What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do.
I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.
Which, it is: at very small scale and with best effort reliability.
It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems.