标签云

微信群

扫码加入我们

WeChat QR Code

We are compiling an embedded C/C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and crashes more often than we would like. The hardware is designed for this environment, and our application has run on this platform for several years.Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errors and memory-corruption caused by single event upsets? Have any other developers had success in reducing the harmful effects of soft errors on a long-running application?


Are the values in memory changing or are values in the processor changing?If the hardware is designed for the environment, the software should run as if running on a non-radioactive environment.

2019年06月25日01分07秒

If possible, you should set up a logging system that stores events in non-volatile memory that is resistant to radiation.Store enough information so that you can trace the event and easily find the root cause.

2019年06月26日01分07秒

Thomas Matthews All memory has a FIT error rate, and hardware manufactures make lots of promises. Most of the issues are likely caused by SEUs modifying ram at runtime.

2019年06月25日01分07秒

This is a combination hardware/software solution, but I know Texas Instruments (and probably others) makes embedded chips for safety critical applications that consist of two duplicate cores, running in lockstep, half a clock cycle out of phase. There are special interrupts and reset actions that get taken when the hardware detects something different between the cores, so you can recover from errors. I believe TI brands them as "Hercules" safety processors.

2019年06月26日01分07秒

Redundant rugged motors, some gears, shafts and ratchets! Replace annually or more often as dose rates require.No really, my first question with these kinds of issues has always been, do you really need that much software in there? Be as analog as you can possibly get away with.

2019年06月26日01分07秒

This actually sounds like something that a pure language would be good at. Since values never change, if they are damaged you can just go back to the original definition (which is what it is supposed to be), and you won't accidentally do the same thing twice (because of lack of side effects).

2019年06月26日01分07秒

RAII is a bad idea, because you can't depend on it performing correctly or even at all. It could randomly damage your data etc. You really want as much immutability as you can get, and error correction mechanisms on top of that. It's much easier to just throw away broken things than it is to try and repair them somehow (how exactly do you know enough to go back to the correct old state?). You probably want to use a rather stupid language for this, though - optimizations might hurt more than they help.

2019年06月25日01分07秒

PyRulez: Pure languages are an abstraction, hardware isn't pure. Compilers are quite good at hiding the difference. If your program has a value it logically shouldn't use anymore after step X, the compiler may overwrite it with a value that's calculated in step X+1. But this means you can't go back. More formally, the possible states of a program in a pure language form an acyclic graph, which means two states are equivalent and can be merged when the states reachable from both are equivalent. This merger destroys the difference in paths leading to those states.

2019年06月26日01分07秒

Vorac - According to the presentation the concern with C++ templates is code bloat.

2019年06月25日01分07秒

DeerSpotter The exact problem is much more bigger than that. Ionization can damage bits of your running watcher program. Then you will need a watcher of a watcher, then- watcher of a watcher of a watcher and so on ...

2019年06月26日01分07秒

Nowadays, there is ECC available through hardware, which saves the processing time. Step one would be to pick a microcontroller with built-in ECC.

2019年06月25日01分07秒

Somewhere in the back of my mind is a reference to avionics (perhaps space shuttle?) flight hardware where the redundant architecture was explicitly designed not to be identical (and by different teams).Doing so mitigates the possibility of a systemic error in the hardware/software design, reducing the possibility of all of the voting systems crashing at the same time when confronted with the same inputs.

2019年06月25日01分07秒

PeterM: AFAIK that's also claimed for the flight software for the Boeing 777: Three versions by three teams in three programming languages.

2019年06月25日01分07秒

DanEsparza RAM typically has either a capacitor (DRAM) or a few transistors in feedback (SRAM) storing data. A radiation event can spuriously charge/discharge the capacitor, or change the signal in the feedback loop. ROM does not typically need the ability to be written (at least without special circumstances and/or higher voltages) and hence may be inherently more stable at the physical level.

2019年06月25日01分07秒

DanEsparza: There are multiple types of ROM memories. If the "ROM" is emulated by i.e. eeprom or flash readonly-at-5v but-programmable-at-10v, then indeed that "ROM" is still prone to ionization. Maybe just less than others. However, there are good ol' hardcore things like Mask ROM or fuse-based PROM which I think would need a really serious amount of radiation to start failing. I don't know however if there are still manufactured.

2019年06月25日01分07秒

I really like your response.This is a more generic software approach to data integrity, and an algorithmic based fault tolerance solution will be used in our final product.Thanks!

2019年06月25日01分07秒

One way of dealing with booleans being corrupted (as in your example link) could be to make TRUE equal to 0xffffffff then use POPCNT with a threshold.

2019年06月25日01分07秒

wizzwizz4 Given that the value 0xff is the default value of non-programmed flash cell, that sounds like a bad idea.

2019年06月26日01分07秒

%01010101010101010101010101010101, XOR then POPCNT?

2019年06月25日01分07秒

wizzwizz4 Or just the value 0x1, as required by the C standard.

2019年06月26日01分07秒

wizzwizz4 Why you use some or all of the above mentioned methods (ECC, CRC etc). Otherwise the cosmic ray may as well flip a single bit in your .text section, changing an op code or similar.

2019年06月26日01分07秒

Realistically speaking, how many modern compilers are there that don't offer -O0 or an equivalent switch? GCC will do a lot of strange things if you give it permission, but if you ask it not to do them, it's generally able to be fairly literal too.

2019年06月25日01分07秒

Sorry, but this idea is fundamentally dangerous. Disabling optimizations produces a slower program. Or, in other words, you need a faster CPU. As it happens, faster CPU's are faster because the charges on their transistor gates are smaller. This makes them far more susceptible to radiation. The better strategy is to use a slow, big chip where a single photon is far less likely to knock over a bit, and gain back the speed with -O2.

2019年06月26日01分07秒

A secondary reason why -O0 is a bad idea is because it emits far more useless instructions. Example: a non-inlined call contains instructions to save registers, make the call, restore registers. All of these can fail. An instruction that's not there cannot fail.

2019年06月25日01分07秒

Yet another reason why -O0 is a bad idea: it tends to store variables in memory instead of in a register. Now it's not certain that memory is more susceptible to SEU's, but data in flight is more susceptible than data at rest. Useless data movement should be avoided, and -O2 helps there.

2019年06月25日01分07秒

MSalters: What's important is not that data be immune to disruption, but rather that the system be able to handle disruptions in a manner meeting requirements.On many compilers disabling all optimizations yields code that performs an excessive number of register-to-register moves, which is bad, but storing variables in memory is safer from a recovery standpoint than keeping them in registers.If one has two variables in memory which are supposed to obey some condition (e.g. v1=v2+0xCAFEBABE and all updates to the two variables are done...

2019年06月25日01分07秒

Binary counter watchdogs with TTL standard ICs is indeed a 1980s solution. Don't do that. Today, there doesn't exist a single MCU on the market without built-in watchdog circuitry. All you need to check is if the built-in watchdog has an individual clock source (good, most likely the case) or if it inherits its clock from the system clock (bad).

2019年06月25日01分07秒

Or implement the watchdog in an FPGA: ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20130013486.pdf

2019年06月25日01分07秒

Still used extensively in embedded processors, incidentally.

2019年06月25日01分07秒

Peter Mortensen Kindly stop your edit spree on every answer to this question. This is not Wikipedia, and those links are not helpful (and I'm sure everyone knows how to find Wikipedia anyhow...). Many of your edits are incorrect because you don't know the topic. I'm doing roll-backs on your incorrect edits as I come across them. You are not turning this thread better, but worse. Stop editing.

2019年06月25日01分07秒

Jack Ganssle has a good article on watchdogs: ganssle.com/watchdogs.htm

2019年06月25日01分07秒

Ethernet is probably not a great idea to use in mission-critical applications. Neither is I2C, outside the PCB itself. Something rugged like CAN would be far more suitable.

2019年06月25日01分07秒

Lundin Fair point, though anything optically connected (incl. ethernet) should be OK.

2019年06月25日01分07秒

The physical media is not so much the reason why Ethernet is unsuitable, but the lack of deterministic real-time behavior. Though I suppose there is nowadays ways to provide somewhat reliable Ethernet too, I just group it together with commercial/toy electronics out of old habit.

2019年06月25日01分07秒

Lundin that is a fair point, but as I'm suggesting using it to run RAFT, there will be (theoretically) non-deterministic realtime behaviour in the algorithm anyway (e.g. simultaneous leader elections resulting in a rerun election being similar to CSMA/CD). If strict realtime behaviour is needed, arguably my answer has more problems than ethernet (and note at the head of my reply I said 'correct' was likely to be at the expense of 'fast' often). I incorporated your point re CAN though.

2019年06月25日01分07秒

Lundin: No system which involves asynchronous aspects can be fully non-deterministic.I think the worst-case behavior of Ethernet can be bounded in the absence of hardware disruptions if software protocols are set up in suitable fashion and devices have unique IDs and there is a known limit to the number of devices (the more devices, the larger the worst-case number of retries).

2019年06月25日01分07秒

A few of these suggestions have something along a similar 'multi-bit sanity check' mindset for detecting corruption, I really like this one with the suggestion of safety-critical custom datatypes the most though

2019年06月25日01分07秒

There are systems in the world where each redundant node was designed and developed by different teams, with an arbiter to make sure they didn't accidentally settle on the same solutions. That way you don't have them all going down for the same bug and similar transients don't manifest similar failure modes.

2019年06月26日01分07秒

I'm trying to fathom a scenario where you can have a master outside the radiation environment, able to communicate reliably with slaves inside the radiation environment, where you couldn't just put the slaves outside of the radiation environment.

1970年01月01日00分03秒

fostandy: The slaves are either measuring or controlling using equipment that needs a controller. Say a geiger counter. The master does not need reliable communication due to slave redundancy.

2019年06月26日01分07秒

Introducing a master will not automatically mean increased security. If slave x has gone crazy due to memory corruption, so that it is repeatedly telling itself "master is here, master is happy", then no amount of CRCs or barked orders by the master will save it. You would have to give the master the possibility to cut the power of that slave. And if you have a common-cause error, adding more slaves will not increase safety. Also keep in mind that the amount of software bugs and the amount of things that can break increase with complexity.

2019年06月26日01分07秒

That being said, it would of course be nice to "outsource" as much of the program to somewhere less exposed, while keeping the electronics inside the radioactive environment as simple as possible, if you have that option.

2019年06月26日01分07秒

Indeed, nowhere in the question's test the author mentions that the application was found to run just fine outside radioactive environment.

2019年06月25日01分07秒

Surely an embedded system would much prefer safety critical catches in one instance of a robust application than just firing off several instances, upping the hardware requirements and to some extent hoping on blind luck that at least one instance makes it through okay?I get the idea and it's valid, but lean more towards the suggestions that don't rely on brute force

2019年06月25日01分07秒

Or rather, read through the technical requirements and implement those that make sense. A large part of the SIL standards is nonsense, if you follow them dogmatically you will end up with unsafe and dangerous products. SIL certification today is mainly about producing a ton of documentation and then bribe a test house. The SIL level says nothing about the actual safety of the system. Instead, you'll want to focus on the actual technical safety measures. There are some very good ones in the SIL documents, and there are some complete nonsense ones.

2019年06月26日01分07秒

Perhaps an optical medium such as a CD-ROM would meet this definition.It would have the added bonus of a large capacity.

2019年06月25日01分07秒

Yes it will be similar but cd rom will use lesser but this will be fully mechanical system.

2019年06月25日01分07秒

I wonder if there is a reason for why they don't use punch-card readers in space.

2019年06月26日01分07秒

Soren Speed and physical space can be a reason.

2019年06月25日01分07秒

I'm a big fan of assembly language (as you can see from my answers to other questions), but I don't think this is a good answer.It's fairly possible to know what to expect from the compiler for most C code (in terms of values living in registers vs. memory), and you can always check to see that it's what you expected.Hand-writing a large project in asm is just a ton of extra work, even if you have developers that are very comfortable writing ARM asm. Maybe if you want to do stuff like compute the same result 3 times, writing some functions in asm makes sense. (compilers will CSE it away)

1970年01月01日00分03秒

The higher risk otherwise that has to be balanced against it is upgrading the compiler can leave you with unexpected changes.

2019年06月25日01分07秒

Sorry, but the question is about radioactive environment where hardware failures will occour. Your answer is about general software optimization and how to find bugs. But in this situation, the failures aren't produced by bugs

2019年06月26日01分07秒

Yes, you can blame also earth gravity, compiler optimizations, 3rd party library, radioactive environment and so on. But are you sure it's not your own bugs ? :-) Unless proven - I don't believe so. I have ran once upon a time some firmware update and testing poweroff situation - my software survived all poweroff situations only after I've fixed all my own bugs. (Over 4000 poweroffs during night time) But it's difficult to believe that there was bug in some cases. Especially when we are talking about memory corruption.

2019年06月25日01分07秒