Safety-critical realtime with Linux

WalterBright · on Sept 30, 2017

Every industry seems determined to learn from scratch on their own how to make safety critical systems. None look at an industry that figured this out 50+ years ago - the airframe industry, which has an incredibly good track record of making safe systems out of unreliable parts.

I wrote a couple articles on the general idea:

https://digitalmars.com/articles/b39.html

https://digitalmars.com/articles/b40.html

fit2rule · on Sept 30, 2017

You're right, but its the Rail transportation people that figured it out before the Air guys ..

WalterBright · on Sept 30, 2017

You'll see bits and pieces of it in other places, for example, the dual independent braking system in all cars. But the auto engineers failed to generalize that into the car's computer controls.

The Fukushima and Deepwater Horizon disasters show how other industries fail at applying the concept. Reading the sequence of failures in those just makes me grind my teeth. For example, venting the hydrogen overpressure, good idea. Venting it into an enclosed space with sparking electrical equipment, spectacularly dumb. (And the list goes on and one with both disasters.)

Nobody has taken it to heart and applied it pervasively like the airframe industry.

kev009 · on Sept 30, 2017

I find that subject line completely terrifying. Please use a small trusted compute base, hopefully with rigorous auditing and attempts at formal modeling, for safety critical systems. The Linux kernel development process is not suitable for this domain.

mechatronix00 · on Sept 30, 2017

I’ve read SpaceX uses Linux to fly Falcon and Dragon. If true, I wonder how they got happy with it for safety critical tasks. Or maybe those safety critical tasks get offloaded to microcontrollers with easier to audit codebases.

edit: reference to the article- https://lwn.net/Articles/540368/

ficklepickle · on Sept 30, 2017

I would have thought they'd use Dragonfly BSD. Get it!?! Dragon FLY... 'Cuz it flies...

I'll show myself out.

reificator · on Sept 30, 2017

Agreed. I use a lot of Linux, but if things go wrong the losses are financial and temporary. I can't imagine trying to design a system that must never fail on top of Linux. Reliability is rarely served by addition.

srcmap · on Sept 30, 2017

Designed realtime with Linux is non-trivial.

I worked on HA (Highly Available) system with Linux inside Xilinx's Vertex Pro PPC. It is redundant system with multiple fault detections and switch over if any subsystem detected failure.

There was one 250 ms hard real time requirements: If I am a slave and don't detect the master 's UDP ping for 250 ms. I will assume the master has failed somehow, I will start action and take over control as master.

The sub-system did trigger from time to time while the master is alive and working perfectly OK.

Eventually I figured out that one of the system API was using > 250 ms time. (Forget which one now, that was > 10 years ago.) I have to profile very carefully and redesign the code to get around that API.

burntrelish1273 · on Sept 30, 2017

Sorta like a distributed watchdog or HA failover.

(Btw my favorite acronym is STONITH.)

occultist_throw · on Sept 30, 2017

Yeah, STONITH bites us on our ass every so often. Primarily, because of split-brain and no good way to ameliorate that without throwing out data.

Its better to have an odd quorum, to break tie-breakers.

aidenn0 · on Sept 29, 2017

Segmentation faults are a bad example of a fault you don't want in a safety-critical system. A crash is okay, because you will usually have some sort of fall-back (e.g. most power steering systems work unpowered). It's non-crashes that cause silent improper behavior that are bad.

Of course, a segmentation fault is usually a symptom of pointer misuse, which means your code is likely to also suffer from corruptions.

WalterBright · on Sept 30, 2017

> Segmentation faults are a bad example of a fault you don't want in a safety-critical system.

A safety critical system, absolutely must be able to handle seg faults without compromising safety. Any system that requires that a seg fault will not occur is not a safe system.

AnimalMuppet · on Sept 29, 2017

Not all systems work with the software down. Yes, power steering works with the power down, but that's for an engine failure, not for an electronics failure or software crash. But I believe that airplanes that use software to adjust the wing control surfaces move them to neutral positions on software failure. (I can't prove that, and I have no first-hand knowledge - just recall hearing it once, if I remember correctly). That means that, while the software's down, the wings won't break off, and it won't crash due to the control surfaces doing something bizarre, but you can't navigate the plane. That's still pretty bad.

srcmap · on Sept 30, 2017

This remind me of debug a flight control problem with my son's model airplane.

We did a foam board stretch build of Model Airplane couple years ago.

https://www.flitetest.com/articles/ft-guinea-build

After we flew a few times, we added a servo to the cargo bay to be open and drop small parachute from it.

After than, my son kept complained that he would loss control for 1-2 seconds while the plane is in the air from time to time.

After research on the net, eventually I figure out it was the addition servo. Each servo in that air plane use about 1 Amp of current depend on condition. The 5th servo actually drew enough current that causes the remote receiver to reboot while the plane is in flight.

That was a fun bug to figure out.

The RC receiver is actually very good design which can be reboot and be operational in ~1 seconds.

skykooler · on Sept 29, 2017

Commercial aircraft have mechanical fallbacks for all controls, and emergency protocols for if some controls don't work (for example, using spoilers to turn if ailerons are inoperative). Many military jets have pure software control, as they are aerodynamically unstable and are unable to be flown safely by hand if the fly-by-wire system goes down; however, the procedure for a system failure in a military jet is usually "eject!", which is not feasible for passenger aircraft.

WalterBright · on Sept 30, 2017

The Boeing 757 had 3 independent hydraulic systems, with 3 actuators per surface. Any two could overpower the 3rd. Any software in the loop was dual, written on different hardware using different algorithms, with a comparator that would physically disconnect both if they disagreed. The pilot could also physically disconnect them via circuit breakers.

There were also various mechanisms to prevent the surfaces from going too far (cannot move full travel at 500 mph, it would rip the airplane apart).

This is all very serious stuff, and was worked over by a lot of people imagining every perverse thing that could happen, and going through the list of things that had happened, to ensure it is safe.

The track record of the 757 in service shows how effective this is:

https://en.wikipedia.org/wiki/Category:Accidents_and_inciden...

None of them resulted in recommendations for design changes.

aidenn0 · on Sept 29, 2017

There are hard realtime systems for which segfaults are a game over, but even for the airplane, there will be a watchdog triggered and the system reboots. This means your control surfaces are neutral for about a second. There are some situations in which that could be fatal, but it typically is not.

If the system gets in a state where it is in a reboot loop, that is obviously bad, but those are more tractable to fix. In any event, I believe that FAA requires semi-formal analysis for software the failure of which would cause loss of life, but I might be wrong.

nickpsecurity · on Sept 29, 2017

Here's the regulations they usually go by:

https://en.wikipedia.org/wiki/DO-178B

https://en.wikipedia.org/wiki/DO-178C

There's an ecosystem of products developed this way, products such as static analyzers to help develop them, and even companies assisting in any part of the lifecycle. Market shifted toward reusable components for things such as RTOS's and middleware. Examples include INTEGRITY-178B, LynxOS-178B, Alt's OpenGL + driver for Radeon, partitioning Ethernet/networking/filesystems, Esterel SCADE, SPARK Ada, and so on. Lots of good stuff came from this.

On a tangent note, it's why I'm for software regulation where the useful subset of assurance methods that make those great become part of the requirements for commercial software. Defects will go down and predictability up across the board. Things will get more expensive with development slowing down a bit. That will probably be a good thing given Worse is Better effect market has been doing.

I went into more detail on regulation prior results and some predicted ones here:

https://www.schneier.com/blog/archives/2016/11/regulation_of...

zurn · on Sept 29, 2017

I wonder how they arrive at the X microsecond worst-case number for the software-based solutions. Does it take into account a perfect storm of APIC interprocessor events, interrupts, SMP cache coherency protocol worst-case behaviour and cross-CPU TLB shootdowns, misses on all levels of instruction/data TLBs and caches and DRAM, CPU trace cache behaviour, ECC machine check events, worst case OoO core behaviour wrt branch prediction and speculative execution, worst case interference from other SMT threads, other SoC functions accessing DRAM, etc?

It would seem to me that a worst case scenario could easily cause slowdowns of many orders of magnitude. You could mitigate some of them by careful manual memory layout and hardware specific tricks like hardwired TLB entries, but still be left with a lot of uncovered stuff.

ridiculous_fish · on Sept 30, 2017

Interrupt-based designs are bad for real time, for the reasons you give. Instead you want to use techniques like polling and static scheduling, where every process gets a fixed time slice. This reduces the variability and improves fidelity to your model, since you know at every point what processes are running.

occultist_throw · on Sept 30, 2017

Easy to say that interrupts are bad.. But when you need to know of a pin change, rising or falling IRQ, interrupts are the only game here. And not only that, but if we were to do the timeslice strategy, then we need that slice to be at least 2x more than the fastest input for that pin.

I would think with that strategy, you'd be servicing "slice-interrupts" more than anything else.

I'll stick with writing ISR's that are only a few commands and do my work outside the ISR's, as standard in industry now.

Now, on a related topic, if they're discussing getting the system to a 10us timing, that would be useful in using Linux as a 3d printer controller directly, rather than an Atmel or STM32 chip. But those requirements of what needs turned off seems pretty onerous, unfortunately.

Too · on Sept 30, 2017

Interrupt analysis is a core element for designing a real time system. High speed data interfaces don't always work good with polling. In theory it's actually quite easy, from the data sheets you can find out or configure the max frequency an interrupt can occur, then you just multiply that frequency with the WCET of your interrupt routine, assume it will always happen and deduct that time from your main loop time budget.

jacquesm · on Sept 30, 2017

The trick to this is to do the absolute minimum required in the interrupt routine and to push the rest of the work to a regular piece of software that gets scheduled at the right priority due to the interrupt having been processed.

Doing actual work in an interrupt routine is a non-starter in a real time environment.

zurn · on Sept 30, 2017

IO interrupts and scheduling based surprises were a pretty small part of the list, though.

cjbillington · on Sept 29, 2017

Looking forward to CPUs getting on-die FPGAs so we can actually chuck some fully deterministic timing tasks onto them. Even if you're running on metal without an OS, there are heaps of things that can stop your code from running with predictable timing, and it seems like it's getting worse as CPUs are getting more complex.

greenleafjacob · on Sept 30, 2017

Related: reducing system jitter [1]

[1] http://epickrram.blogspot.co.uk/2015/09/reducing-system-jitt...

agumonkey · on Sept 30, 2017

simple and super valuable article.

mkj · on Sept 30, 2017

The PRU in some TI chips like that in the Beaglebone give you that kind of thing. A couple of deterministic 200mhz arm cores that run separately from the main cpu. They've been used for lots of motor control projects and also bit-banging network protocols.

mkj · on Oct 2, 2017

Correction: they're not ARM, but a different RISC architecture.

speleo_engr · on Sept 29, 2017

We've had FPGA with CPU for 5 years or so on Xilinx Zynq. :)

kyzyl · on Sept 30, 2017

Amusingly, Xilinx makes dev tools that are so unbelievably bad that it's almost impossible to do any of the best practices for development that you would hope for in safety critical systems. No source control, code review, reproducible builds, integration testing, centralized code review, etc.

Presumably you can sort of brute force it if you have defense level budgets, but it's a seriously bad situation.

speleo_engr · on Sept 30, 2017

Yeah last time I used Vivado a couple years ago, version control on the project was a nightmare. Reproducible builds seem to be a problem with stock FPGA synthesis tools in general.

jeff_marshall · on Sept 30, 2017

Xilinx has being doing CPU+FPGA for longer than that. I did some C runtime enhancements for the PPC 440 included in the FX70 for a product made by a former employer (shadow stacks for better CFI, more locked down MMU use, etc).

arca_vorago · on Sept 30, 2017

I thought that the kernel had improved enough in recent years for sil3... perhaps not though.

I wasn't as aware of the issue of safety-critical systems as I should have been until I was inside a couple industrial companies where PLC's were everywhere (for this very reason). The thing that interests me now about this is how hard I see netconnected PLC's pushing into industrial applications, mostly because everyone in industry is on the edge of their seat for IOT to hit so they can use and abuse the data (instead of waiting for service call to pull data like they used to, why not just use an LTE-modem PLC, for example?) Do you see where I am going with this? Safety-critical industrial applications <sil4 are increasingly more vulnerable, and it's not from lack of realtime response to stimuli. In the end, using linux in realtime just seems to exacerbate this particular angle on the issue that I see. It does make me wonder about the implications of microkernel design vs monolithic in such applications though.

http://www.nfpa.org/codes-and-standards/all-codes-and-standa...

https://en.wikipedia.org/wiki/IEC_61508

https://webstore.ansi.org/RecordDetail.aspx?sku=ANSI%2fRIA+R...

https://www.iso.org/standard/69883.html

https://webstore.iec.ch/publication/22797

https://en.wikipedia.org/wiki/Comparison_of_real-time_operat...

irundebian · on Sept 30, 2017

Can somebody here recommend any books on developing safety-critical systems? I've read some part of Kleidermacher's "Embedded Systems Security" book and found it very helpful.

rocqua · on Sept 30, 2017

The title says [LWN subscriber-only content], the link seems to suggest the same.

It feels like LWM has bad access control, and someone abused that to post an article that shouldn't be free.

cesarb · on Sept 30, 2017

LWN has a cool feature where any subscriber can make these "subscriber links" to share an article with a friend/coworker, a relevant mailing list, or sometimes even HN. See https://lwn.net/op/FAQ.lwn#slinks for details.

In this case, the link was posted by Jon Corbet, who is LWN's main editor and developer.

kragniz · on Sept 30, 2017

The submitter is the LWN editor and the author of the post.

oliwarner · on Sept 30, 2017

They allow subscribers to share the links. But they do suggest you do this on limited scale and conscientiously.