The fail-silent design is the part worth paying attention to. The conventional approach to redundanc...

ajaystream • today at 6:26 AM • 4 replies • view on HN

The fail-silent design is the part worth paying attention to. The conventional approach to redundancy is to compare outputs and vote — three systems, majority wins. What NASA did here instead is make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness. Then the system-level logic just picks the first healthy source from a priority list.

That's a fundamentally different trust model. Voting systems assume every node will always produce output and the system needs to figure out which output is wrong. Fail-silent assumes nodes know when they're compromised and removes them from the decision set entirely. Way simpler consensus at the system level, but it pushes all the complexity into the self-checking pair.

The interesting question someone raised — what if both CPUs in a pair get the same wrong answer — is the right one. Lockstep on the same die makes correlated faults more likely than independent failures. The FIT numbers are presumably still low enough to be acceptable, but it's the kind of thing that only matters until it does.

Replies

adrian_b • today at 7:40 AM

This is similar to the difference between using error-correcting codes and using erasure codes combined with error-detecting codes.

The latter choice is frequently simpler and more reliable for preventing data corruption. (An erasure code can be as simple as having multiple copies and using the first good copy.)

sammy2255 • today at 7:59 AM

Spoken like an LLM.

randomNumber7 • today at 11:27 AM

> make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness

Does this mean you have to trust the already compromised system?

high_na_euv • today at 8:37 AM

How you can remove component from decision set if it is the only component in the whole decision set?

alt Hacker News

Replies