How NASA built Artemis II’s fault-tolerant computer

580 points • by speckx • yesterday at 3:12 PM • 217 comments • view on HN

Comments

The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.

➕ show 13 replies

georgehm • today at 4:11 AM

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

➕ show 7 replies

TonyAlicea10 • today at 10:39 AM

When I was first starting out as a professional developer 25 years ago doing web development, I had a friend who had retired from NASA and had worked on Apollo.

I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.

The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.

➕ show 1 reply

__d • today at 2:28 AM

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

➕ show 2 replies

eggy • today at 4:19 PM

Some related good books I have been studying the past few years or so. The Spark book is written by people who've worked on Cube sats:

  * Logical Foundations of Cyber-Physical Systems

  * Building High Integrity Applications with SPARK 

  * Analysable Real-Time Systems: Programmed in Ada

  * Control Systems Safety Evaluation and Reliability (William M. Goble)

I am developing a high-integrity controls system for a prototype hoist to be certified for overhead hoisting with the highest safety standards and targeting aerospace, construction, entertainment, and defense.

y1n0 • today at 1:30 AM

NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.

➕ show 7 replies

albertzeyer • today at 12:45 PM

I'm curious: In the current moon flyby, how often did some of these fallback methods get active? Was the BFS ever in control at any point? How many bitflips were there during the flight so far?

geomark • today at 3:46 AM

I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.

➕ show 1 reply

dojopico • today at 11:28 AM

I did VOS and database performance stuff at Stratus from 1989-95. Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”. Each board had redundant everything and was paired with a second board. Every pin out was compared on every tick. Boards that could not reset called home. The switch from Motorola 68K to Intel was a nightmare for the hardware group because some instructions had unused pins that could float.

Schlagbohrer • today at 12:12 PM

"High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover."

I assume this means they are using a digital twin simulation inside the HPC?

dom111 • today at 11:16 AM

I always wondered if the "radiation hardening" approaches of the challenges like this https://codegolf.stackexchange.com/questions/57257/radiation... (see the tag for more https://codegolf.stackexchange.com/questions/tagged/radiatio...) would be of any practical use... I assume not, as the problem is on too many levels, but still, seems at least tangentially relevant!

jbritton • today at 1:10 AM

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

➕ show 3 replies

kev009 • today at 9:35 AM

Some people are claiming it's the good old RAD750 variant. Is there anything that talks about the actual computer architecture? The linked article is desperately void of technical details.

➕ show 1 reply

bharat1010 • today at 2:18 PM

The part about triple-redundant voting systems genuinely blew my mind — it's such a different world from how most of us write software day to day, and honestly kind of humbling.

➕ show 2 replies

starkparker • yesterday at 3:40 PM

Headline needs its how-dectomy reverted to make sense

➕ show 1 reply

JumpCrisscross • today at 7:14 AM

Does anyone know how this compares to Crew Dragon or HLS?

guenthert • today at 10:21 AM

Multiple and dissimilar redundancy is nice and all that, but is there a manual override? Apollo could be (and at least in Apollo 11 and 13 it had to), but is this still possible and feasible? I'd guess so, as it's still manned by (former) test pilots, much like Apollo.

vhiremath4 • today at 5:17 AM

> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!

lrvick • today at 9:49 AM

NASA describes some impressive work for runtime integrity, but the lack of mention of build-time security is surprising.

I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?

➕ show 1 reply

PunchyHamster • today at 3:44 PM

I wonder how they made the voted-answer-picker fail-resistant

object-a • today at 12:59 AM

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

➕ show 5 replies

stevepotter • today at 11:41 AM

It would be nice to see some of the software source. I’m super interested and i think I helped pay for it

0xblinq • today at 3:06 PM

They should have also built a fault tolerant toilette.

nickpsecurity • today at 3:28 AM

The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

RobRivera • today at 3:46 PM

2 outlooks.

Two.

spaceman123 • today at 4:32 AM

Probably same way they’ve built fault-tolerant toilet.

➕ show 1 reply

gambiting • today at 8:12 AM

So honest and perhaps a bit stupid question.

Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?

➕ show 1 reply

SeanAnderson • today at 4:44 AM

Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

➕ show 2 replies

ck2 • today at 2:24 PM

if I remember correctly the space shuttle had four computers that all did the same processing and a fifth that decided what was the correct answer if they all didn't match or some went down

can't find a wikipedia article on it but the times had an article in 1981

https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...

apparently the 5th was standby, not the decider

pbronez • today at 4:06 PM

The Artemis computer handles way more flight functions than Apollo did. What are the practical benefits of that?

This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?

hpcgroup • today at 2:12 PM

[flagged]

veunes • today at 5:08 PM

[dead]

temptemptemp111 • today at 7:52 AM

[dead]

perarneng • today at 6:51 AM

[dead]

ConanRus • today at 12:18 AM

[dead]

hulitu • today at 5:29 AM

They run 2 Outlook instances. For redundancy. /s

huxleyFiddler • today at 10:44 AM

[flagged]

➕ show 2 replies

seemaze • today at 1:59 AM

and yet.. https://news.ycombinator.com/item?id=47615490

➕ show 1 reply

GautamB13 • today at 8:58 AM

It kinda crazy how this mission didn't become mainstream media until as of late.

ajaystream • today at 6:26 AM

The fail-silent design is the part worth paying attention to. The conventional approach to redundancy is to compare outputs and vote — three systems, majority wins. What NASA did here instead is make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness. Then the system-level logic just picks the first healthy source from a priority list.

That's a fundamentally different trust model. Voting systems assume every node will always produce output and the system needs to figure out which output is wrong. Fail-silent assumes nodes know when they're compromised and removes them from the decision set entirely. Way simpler consensus at the system level, but it pushes all the complexity into the self-checking pair.

The interesting question someone raised — what if both CPUs in a pair get the same wrong answer — is the right one. Lockstep on the same die makes correlated faults more likely than independent failures. The FIT numbers are presumably still low enough to be acceptable, but it's the kind of thing that only matters until it does.

➕ show 4 replies

alt Hacker News

How NASA built Artemis II’s fault-tolerant computer

Comments