eheap Part 3: Self Healing

Manuals

White Papers

	Home > Library > White Papers

eheap Part 3: Self Healing

by Ralph Moore, eheap Architect
August 2021

Introduction

eheap is a new embedded system heap that provides self-healing features to help exposed embedded systems better deal with environmental disturbances, malware, cyber attacks, and other problems. eheap will run on any RTOS-based or standalone system.

This is the third in a series of three papers on eheap:

Why Heaps?

A data structure may contain valuable information, but each data item in it can be subjected to reasonableness checks and discarded if the checks fail. Data being transmitted or received is generally protected by error codes. Heaps have a high density of pointers. For example, a heap with average chunk size of 32 bytes has a pointer density 25%. This makes heaps particularly vulnerable to bit flips and overwrites, yet they are unprotected. One bad pointer can put the system into the weeds. The pointer density of code may be comparable to heaps, but code is usually in flash, which is safe from overwrites and relatively immune to bit flips. Of course, there are other RAM areas with high pointer densities, but heaps seem to be a good place to start.

Self-Healing Assumptions

Implementing self-healing requires making reasonable assumptions. For eheap, it is assumed that only one word can be damaged at a time. It is further assumed that a heap has hundreds to hundreds of thousands of chunks and that millions to billions of normal heap operations will occur between events that damage the heap. Hence, the odds are good that there is sufficient time to fix the heap before a malloc() or free() steps on a rotten pointer. This extra bit of protection could be vital if large numbers of Things are running in exposed environments as is currently envisioned for the IoT. Self-healing is implemented in eheap primarily by heap scanning and bin scanning, which are discussed next.

Heap Scanning

Because every chunk has both a forward link to the next chunk and a backward link to the previous chunk, eheap can be scanned forward and backward. This is necessary to find and fix broken links, as well as to fix other broken fields in CCBs. The eheap heap scan service is intended to be run during idle time so that it will not interfere with normal operation. (As discussed in prior parts of this series, real-time systems must have significant idle time in order to handle peak events and to be expandable for future requirements.)

eheap scans each chunk in the heap from the start to the end of the heap, fixing broken links, flags, sizes, and fences, as it goes. In the debug version, the scan stops on a broken fence so that the fence can be examined for clues concerning what happened. In the release version, the fence is fixed. Fixes are reported and saved in the event buffer for later system analysis. It is important to note that during scans, all pointers are heap-range tested before use in order to avoid data abort exceptions that might stop the system.

If a broken forward link is found, and the chunk is either a free chunk or a debug chunk, the chunk size field is used to attempt a fix. (Chunk size is effectively a redundant forward link.) Otherwise, a backward scan is started from the end of the heap, which proceeds until the break is found and fixed. While back scanning, other broken forward links are also fixed, if encountered. However a broken backward link stops the backward scan and a bridge is formed between the chunk with the broken forward link and the chunk with the broken backward link. When this happens, several chunks may be bridged over. Bridging serves to allow the scan to complete and may give the system a chance to limp along and to possibly fix itself, but generally it does not fix the problem and thus a HEAP_BRKN error is reported.

Need for Incremental Scanning

In a hard real-time system it is imperative that high-priority tasks not be blocked from running for so long that they miss their deadlines. However, the heap scan must not be preempted by other heap functions, or errors will occur. Hence heap scan must be a system service routine (SSR), which is non-preemptible.

To resolve this conflict, heap scan has been designed to operate in an incremental fashion. It is called as follows:

   smx_HeapScan(cp, fnum, bnum);

Where cp points at the chunk from which to start this run, fnum is the number of chunks to scan forward, and bnum is the number of chunks to scan backward. bnum is usually larger than fnum because backward scanning is faster and it is more urgent because a broken forward link has been found that needs to be fixed. For example, fnum might be 2 and bnum might be 50. Hence, higher priority tasks would usually be blocked only for the time needed to scan two chunks, unless a break was found.

Heap Scan Operation

Each call of HeapScan() is called a run; each time the full heap is scanned is called a scan. A scan may require thousands of runs, possibly more. HeapScan() is usually called as follows:

   smx_HeapScan(NULL, fnum, bnum);

cp == NULL means to use the eheap static forward or backward pointer to start the run. These pointers point to the first chunk to scan when a run begins. If the hs_fwd mode is ON, the direction of the scan is forward, if it is OFF, the direction of the scan is backward. Forward is the normal direction. If a free() merges the chunk pointed to by either pointer, the pointer is moved to point at the resultant merged chunk.

When a broken forward link is found, hs_fwd is turned OFF and the static backward scan pointer is set to point at the end of the heap. The next run starts a backward scan. When the backward scan has fixed or bridged the break, hs_fwd is turned back ON and forward scanning resumes on the next run from where it stopped.

When the end of the heap is reached, or an unfixable error occurs, HeapScan() returns TRUE. Otherwise it returns FALSE, which means to keep scanning. Hence calling it from idle can be done as follows:

   if (--heap_scan == 0)
   {
      smx_HeapScan(NULL, 2, 50);
      heap_scan = HEAP_SCAN_CNT;
   }

Then HeapScan() will be called once per HEAP_SCAN_CNT idle passes. This limits time spent scanning. An unfixable error would normally be handled by the error manager, but it might be useful to deal with it here. In either case, hs_fwd is turned ON and the static forward pointer is set to NULL. This causes the next run to start from the beginning of the heap.

Heap scan steadily patrols the heap looking for trouble and fixing what it can. For one run per tick, fnum = 2, and a 100,000 chunk heap, 50,000 ticks would be required to complete a full heap scan. That is 500 seconds at 100 ticks/sec, which should be adequate. As noted above, this provides a little extra protection against the many things that can to wrong in a heap over a long period of time.

Bin Scanning

Whereas bins are in local memory and may be less susceptible to bit flips, most bin free list pointers are in the free chunks out in the heap and thus are just as susceptible to damage as are heap links. Thus bin free list scanning is necessary to provide a consistent level of protection. Bin scan is similar to heap scan. It is incremental, scans doubly-linked chunk lists, and fixes broken links, if it can. It is called as follows:

   smx_HeapBinScan(binno, fnum, bnum);

It has three parameters: binno, fnum, and bnum. Like heap scan, it returns FALSE until it is done with a bin or an unfixable error is found. If a preempting malloc() uses a bin being scanned, the bin scan is restarted. (Bin free chunk lists are assumed to be much smaller than the heap.)

Like heap scan, bin scan scans backward to fix broken forward links, fixes are reported, and double breaks are bridged. In that case, the bridged chunks are no longer available to be allocated, but heap operation can continue normally, as long as merging is OFF. Thus, bridging is a partial solution, but HEAP_BRKN is still reported in order to allow a better solution to be implemented. If merging is ON, free may fail due to encountering a bad pointer when it attempts to merge a bridged chunk.

Bin scan is also intended to be called from idle, such as:

   void IdleMain(void)
   {
      static u32  i = 0;
      ...
      if (smx_HeapBinScan(i, 2, 10))
         i = (i == smx_top_bin ? 0 : i++);
      ...
   }

This results in all bins being scanned, 2 chunks forward, at a time. For simplicity, unfixable errors are assumed to be handled by the error manager.

Heap Recovery and Extend

Due to its use of selective deferred free chunk merging, eheap runs a higher risk of allocation failure than does a heap like dlmalloc, which enables merging all the time. As a consequence a heap recovery service has been implemented:

   smx_HeapRecover(sz, fnum);

which searches the heap incrementally for adjacent free chunks that can be merged to form a block >= sz. If found, this service merges the chunks and puts the merged chunk into a bin, so that malloc() can be recalled to get it.

eheap also provides a heap extension service

   smx_HeapExtend(xsz, xsp);

to extend the heap into adjacent or non-adjacent RAM in order to recover from an allocation failure. This could be useful if there is some unused RAM in a system due to, for example: a feature not being installed, provision of expansion RAM for the future, or availability of lower performance RAM.

Error Handling

As noted in Part 2, eheap does extensive testing of heap service parameters, links, flags, fences, etc. and reports errors found to the Error Manager (EM). In a thoroughly debugged, mature system, these kinds of errors are likely to be due to malware or cyber attacks, rather than bugs. Application code to recover from them can be put into smx_EMHook(). Alternatively, most error testing can be turned off in release versions by compiling with HEAP_SAFE OFF. This improves performance at the risk of reduced safety.

If all else fails, the HEAP_BRKN error can be handled by reinitializing the heap and restarting all tasks that use it. This can be a workable solution, if high-priority, mission-critical tasks use only block pools. In this case, valuable data may be lost, but the system will continue performing its vital functions. By taking action before malloc() or free() encounter a broken link, the system is saved from going out of control and failing completely.

Summary

eheap is a new embedded heap that has three types of Chunk Control Blocks (CCBs): free, inuse, and debug. It offers significant debug features, such as debug CCBs, chunk fill patterns, error checking, and trace functions, which are described in Part 3. These can be of great value in finding and fixing heap-usage bugs, which often are difficult to track down by other means.

back to White Papers page

Home Sitemap Contact