How do you design a configuration file system that copes with abrupt shutdown in embedded environment?

Question

I'm designing a software that manages configuration file at application layer in embedded Linux.

Generally, it maintains two copies of the configuration file, one in RAM and one in flash memory. As soon as end-users update setting(s) by UI, the software saves it to the file in RAM, and then copy-paste it to the file in flash memory.

This scheme makes sure best stability in that the software reflects reality at the next second. However, the scheme compromises longevity to flash memory by accessing it every time.

As to longevity issue, I've thought about it by having a dedicated program doing this housekeeping, and adds this program to crontab then let it run like every 30 mins. (Note: flash memory wears off only during erase cycles; the program only does housekeeping if the both files are not the same.)

But if the file in RAM is waiting for the program to do housekeeping and system shuts down unexpectedly, the file will lose.

So I'm thinking is there a way to have both longevity and not losing file at the same time? Or am I missing something?

The normal way such things are implemented is by keeping a "mirror segment" of the same data in flash, so that it must always have two identical copies stored. In case there's power loss during programming, the program can revert to the non-corrupt version and also restore the corrupt one. Ideally you would have ECC in the MCU, but at least some manner of CRC. — Lundin, Jan 17 '22 at 09:58
@Lundin I'm not quite understanding. This way doesn't solve longevity issue but worsen it. In your case there are two copies in flash instead of one waiting for update every time. — Mr.nerd3345678, Jan 21 '22 at 05:59
You need to design with margins for the purpose of data retention anyway. Mirror segments is a solution to flash corruption during power loss. Let me flesh this out in an answer. — Lundin, Jan 21 '22 at 07:20

Lundin · Answer 1 · 2022-01-21T08:05:46.937

There are many different reasons why flash can get corrupted: data retention over time, erase/write failures which are primarily caused by erase/write cycle wear, clock inaccuracies, read disturb in case of NAND flash, and even less likely errors sources such as cosmic rays or EMI. But also as in your case, algorithmic layer problems such as a flash erase/write getting interrupted by power loss or reset caused by EMI.

Similarly, there are many ways to deal with these various problems.

CRC16 or CRC32 depending on flash size is the classic way to deal with pretty much all possible flash errors, particularly with data retention since it most often manifests itself as single-bit errors, which CRC is great at discovering. It should ideally be designed so that the checksum is placed at the end of each erase-size segment. Or in case erase-size is very small (emulated eeprom/data flash etc), maybe a single CRC32 at the end of all data. Modern MCUs often have a CRC hardware peripheral which might be helpful.

Optionally you can let the CRC algorithm repair single bit errors, though this practice is often banned in high integrity systems.
ECC is used on NAND flash or in high integrity systems. Traditionally done through software (which is quite cumbersome), but lately also available through built-in hardware support, particularly on the "safety/chassis" kind of automotive microcontrollers. If you wish to use ECC then I highly recommend picking a part with such built-in support, then it can be used to replace manual CRC (which is somewhat painful to deal with real-time wise).

These parts with hardware ECC may also support a feature with an area where you can write down variables to have the hardware handle writing them to flash in the background, kind of similar to DMA.
Using the flash segment as FIFO. When storing reasonably small amounts of data in memory with large erase sizes, you can save flash erase/write cycles by only erasing the whole segment once, after which it will likely be set to "all ones" 0xFFFF... When writing, you look for the last available chunk of memory which is "all ones" and write there, even though the same data was previously written just before it. And when reading, you fetch the last written chunk before "all ones". Only when the whole erase size is used up do you perform an erase and start over from the beginning - data needs to be stored in RAM during this.

I strongly recommend picking a part with decent data flash though, meaning small erase sizes - so that you don't need to resort to hacks like this.
Mirror segments where all memory is stored as duplicates in two separate segments is mandatory practice for high integrity systems, though this can also be used to prevent corruption during power loss/unexpected resets and of course flash corruption in general. The idea is to always have at least one segment of intact data at all times, and optionally repair a corrupt one by overwriting it with the correct one at start-up. Also meaning that one segment must be verified to be correct and complete before writing to the next.
Keep the product cool. This is a hardware solution obviously, but data retention in particular is heavily affected by ambient temperature. The manufacturer normally guarantees some 15-20 years or so up to 85°C, but that might mean 100 years if you keep it at <25°C. As in, whenever possible, avoid mounting your MCU PCB near exhausts, oil coolers, hydraulics, heating elements etc etc.

Mirror segments in combination with CRC and/or ECC is likely the solution you are looking for here. Again, I strongly recommend to pick a MCU with dedicated data flash, meaning small erase segments and often far more erase/write cycles, ideally >100k.

How do you design a configuration file system that copes with abrupt shutdown in embedded environment?

1 Answers1