> A dev might think you can just catch and log the exception. Doesn't fix it.
You've just succinctly made the argument against checked exceptions FWIW (which I agree with you on). Anyone who has used Java in anger (is there any other way?) will be familiar with:
try {
doSoemthing();
} catch (CheckedException e) {
logger.error("Didn't work", e);
}
Fault tolerance is general is terrible in most software. One of my biggest bugbears is network latency and transient failures in network requests that would be solved with a simple retry. But no, there's an incredibly lazy "Request failed" dialog to the user. That's the equivalent of the "log and silently swallow" pattern above. It can get a lot worse than that too. I have an app on my phone that will log me out and force me into a 2FA cycle if it hits a network timeout. Like.... WHYW?!?!?! Anyway, I digress...This is largely a sotware issue. Control systems are built to handle these kinds of things. A traffic light can't accidentally show green in two directions. It's literally wired for that to be impossible because it's simply too important for it to not be possible. You constantly have to deal with faulty sensors so you have systems that will seek a consensus from 3+ sensors and, if that fails, it'll fail until you fix it.
But in software the standards just seem to be much lower even though it can be critical, even lethal eg [1]. Network interfaces should be fuzzed. Every IO operation should assume it can fail and be tested for when it does. Every IO operation should produce unexpected output. And it's simply cost-cutting and a lack of regulation that allows this sloppiness to persist. There should certainly be strict liability for any companies that allow this to happen.
[1]: https://ethicsunwrapped.utexas.edu/case-study/therac-25
> It can get a lot worse than that too. I have an app on my phone that will log me out and force me into a 2FA cycle if it hits a network timeout.
I use some fairly popular (in the MSP space) backup software that thinks the network is infallible. The worst case I’ve seen is when it fails on a network request, doesn’t retry adequately, and incorrectly logs the error as data corruption.
IMO a lot of these problems come down to the same root cause: we are not fully enumerating and reasoning about failure cases.
Let's say you want to retry a network request. It's... A bit more complex than it seems, right?
Firstly, you need to know exactly what type of error you ran into. Some errors aren't really recoverable. Maybe a programming issue occurred and you are constructing an invalid URL and the HTTP client is yelling at you. No sense in retrying that 20 times. Maybe it's a network error, that seems like a good candidate to retry. Maybe, the request succeeded and we have a response, but it is a 500 error, again, seems like a good candidate.
Secondly, you need to know if it is safe to retry. If the request is essentially idempotent, like a read-only GET request, then surely it is safe, right? But, what if it isn't safe? Forget about solutions like idempotency tokens; let's assume you don't control that. Now you need to figure out how you can know if the request had side effects. If a well-known 4xx error is returned you might know, but if you get a network error or a 5xx error it's much harder. Did the request fail during a buffered response after the side effects were already applied? Maybe you can check to see if the request applied with another request. Now you have two network requests, and both need error handling.
Finally, and probably most obviously, you have to make sure you don't hammer the server when it is under load. To avoid the thundering herd problem, you'll probably want to use an exponential backoff with some jitter.
What sucks about all of this is that while there are reusable components here, the concerns effortlessly cut through different layers, making them a pain in the ass to deal with. It isn't that it is impossible for a library to handle all of these problems (I anticipate an excited evangelist may reply explaining how their favorite library does it all in one package if this post gets enough visibility) it's just that this is hard and these problems repeat in different forms, in a way that makes it difficult to fully eliminate the repetition. And this is just the most obvious basics, whereas in reality there are almost always case-specific complexities.
You can, for example, encapsulate a reasonable exponential backoff with deadline implementation and apply that as appropriate for different things, but you can't really cheat your way out of having to think about all of these things, especially if you don't control all of the network APIs you might have to interface with.
This is one part of why I don't like try/catch exceptions. They are an appropriate mechanism to use as a failure isolation boundary due to their stack unwinding capability: it would still be bad in most cases if a logic error or upstream error not being handled properly in a single network request handler were able to crash an entire network server, so being able to blanket catch everything that bubbles up an log it is good. But then using this for normal error handling, it makes doing the wrong thing perhaps just a bit too easy. I don't think you should have to self-flaggelate in order to say "just crash if this errors", but I do think that you should have to say it. Try/catch exceptions are backwards by default, just write normal looking control flow and no errors are handled and it's hard to tell if there even are any. Checked exceptions try to fix this but somehow this feels even worse; now you have a flattened list of exceptions that may occur at various different layers of depth, in some cases the same exception can occur at different layers of depth, you may literally need to read source code and map out the call stack in your head to be sure. (Hope it doesn't change later.)
The Result or Expected type concept seems like the way to go in the frame of modern programming languages. Go's error passing also works OK though it has papercuts (that a linter can help you with, at least.) To me it makes more sense to make stack unwinding error handling a more niche feature used for isolating error domains, rather than use them for all error handling.
But even that! Even that doesn't solve the problem. You still have to sit there and think about the types of errors that can occur and their consequences. At best, explicit error handling with value types just encourages you to confront it and makes it visible, even in cases where you still say "OK, pass to caller".
As much as I agree with the spirit of your post, standards did in fact change after the Therac-25 incident. That was nearly 50 years ago, after all! There are very high quality bars for medical equipment.