The Eight Kernels Error Metrics
1. Building LLVM 3.2 2. Building Fault Injector LLVM Pass
1. Matrix Multiplication 2. Metropolis-Hastings MCMC 3. FFT 4. Breadth First Search 5. MD5 6. Set Union 7. Quicksort 8. GREP
Input Generation Fault Injection References
NSF Logo

Soft-Error Tolerant Big Data Kernels

(Sponsored by NSF Grant CNS-1527318), (Collaborative Project: CNS-1527051)


Due to hardware processes shrinking to several nanometers and the scaling up of Big Data analytics infrastructure, the threat of soft error-induced bit flips pose threats on Big Data analytics applications that cannot be ignored. Handling these threads requires exploration on multiple fronts: understanding of algorithms will be conducive to efficient error correction, while understanding of scaling-up will be helpful for efficient checkpointing, which can help error recovery. This page provides the experiment tools that can be used for such exploration.

The Eight Kernels

We have chosen eight kernels from BigDataBench, performed fault injection experients and added resilience techniques to them. The experiment results have shown that these kenrels may be made fault tolerant with a combination of algorithm-level invariants and low-level invariants. The details of the fault resilience techniques are described in [1].

Matrix-Matrix Multiplication
Matrix-Matrix Multiplication
Metropolis-Hastings Sampling
Metropolis-Hastings MCMC Sampling
Fast Fourier Transform
Fast Fourier Transform
Breadth-First Search
Breadth First Search (BFS)
MD5 Hashing
MD5 Hash
Set Union
Set Union
Global Regular Expression Parser

Fault injection experiments are performed on the original kernels and fault-resilient kernels to introduce bit-flip errors during run-time. Multiple experiments are performed, with one bit-flip error per run. The results are aggregated to form a comparison and evaluated using kernel-specific error metrics to form a comparison between the original and fault-resilient kernels. Details of the experiments are discussed in [1].

Fault Injection Workflow
Fault Injection Workflow

Kernel-Specific Error Metrics

The Error Metrics are used to determine for each of the kernels whether an output is correct and if it's not, how incorrent the output is. For each of the metrics, lower is better (meaning a result is less incorrect.)

Kernels Metric
Matrix Multiplication and FFT RMS error between faulty and golden outputs
Metropolis-Hastings MCMC Absolute error between mean of generated samples
Set Union, Quicksort Number of misplaced elements
Grep Difference in number of occurrences of the searched pattern
MD5 Checksum Whether the output equals the golden output

File Download

The Big Data Kernels come with two versions: a version with fault inejction and a version without fault injection.

For the version with fault injection, the kernels need to be compiled and instrumented through LLVM in order to enable fault injection experiments:

For the version without fault injection, the kernels may be compiled natively, no instrumentation is required.

For both versions, the following input generator is provided for conveniently scaling up inputs in BFS, MD5 and Grep kernels:

The next sections contain directions to how to compile the fault injection framework, how to compile the kernels, and how to use the input generator.

If you use this set of Big Data kernels in your work, please kindly cite [1]. Thanks for your interest!

Building the Fault Injector

The ``Fault Injector'' includes both the Fault Injection binary transform LLVM pass (which is derived from the KULFI fault injector from University of Utah), LLVM 3.2, as well as a run-time library. The run-time library are located in the sources of the big data kernels and are compiled into the binary, which is handled by the Makefiles there, and do not need to be compiled with the fault injection pass.
This section discusses the compilation of LLVM 3.2 and the fault injection pass.

Building and Testing Run the Big Data Kernels

There are three variants for each of the kernels:

  1. Non-fault-injected, non-fault-tolerant
  2. Fault-injected, non-fault-tolerant
  3. Fault-injected, fault-tolerant

The building and compiling of each of the variants are largely the same, but the input formats/parameters and outputs may slightly differ.
Specifically, the non-fault-injected variant will not show fault-injection-specific outputs.
The table below shows the outputs from the two fault-injected variants as examples.
Choose one kernel from the list below to view the build procedure and the expected output for the test runs.

Input Generation

The following shows the steps to generate inputs that take about 1 minute to run on an i5-3210M processor.
Depending on the kernel, inputs may be generated with a script, or the Graph Generator / Text Generator of the Big Data Generator Suite in BigDataBench (we include the versioned that's been modified for compatibility in the Download section.)
The parameters may be changed for larger/smaller inputs.

Fault Injection

The fault injector is designed to inject a single bit flip error into one dynamic instruction (also referred to as a "fault site") in a program run.
To perform fault injection, set the environment variable ``NEXT_FAULT_COUNTDOWN'' and run the instrumented binary.

For example, this command sets a "countdown" of 2585333 to the dynamic instruction that is to be fault-injected:

NEXT_FAULT_COUNTDOWN=2585333 ./qsort.bin

When executed, you will see an output that consist of 3 parts:

$ NEXT_FAULT_COUNTDOWN=2585333 ./qsort.bin 
[Fault Injection Campaign details]
   Max interval: 10000000
Reading configuration from environment variables.
   Next fault CountDown = 2585333
   Should initialize randseed = 0
   Bit position for faults=-1
   Dump BB Trace=0

Part 1 of the output simply shows the fault injection campaign (only the countdown to the next fault is utilized; the fault injection run-time has some other features that are not used in this project.)

Succeffully injected 32-bit Int Data Error!!
Total # faults injected : 1
Bit position is: 7
Index of the fault site : 2856
Total # of fault sites enumerated: 2585334

Part 2 of the output is printed when a fault is actually injected into the program, which includes some details about the current fault-injected run:
- The total # faults line suggests 1 bit-flip error has been introduced so far, which is the error that's been just injected.
- The bit position line suggests the 8th bit (the position is zero-based) has been flipped, which is equivalent to XOR'ing the output of the fault injected instruction with (1<<7).
- The index of the fault site line indicates the ID of the static fault site, namely, the static LLVM instruction.
- The total # of fault sites enumerated line indicates that a total of 2585334 fault sites have been enumerated. (The number may be slightly different from the specified Countdown.)

/*********************Fault Injection Statistics****************************/
Total # fault sites enumerated : 4787069

The last part of the output contains the rest of the program output as well as the fault injection summary.
- "FAULTINESS: 85" is the output from the proram-specific error check. In Quicksort, this means 85 elements in the output are located at incorrect places.
- The last line prints out the number of dynamic fault sites that have been enumerated in this fault-injected run. Because a fault injection may cause the control flow of a program to change, this number may be different from that of a non-fault-injected run, or another fault-injected run.

The size of the entire fault injection space could be prohibitively large, so usually a subset of the space is sampled. Details of the process of sampling and making sure the sample size is large enough to be representative of the characteristics of the kernel in question are discussed in [1].


Please kindly cite the following paper as your reference if you are using the kernels for your work.

[1] LeCompte, T., Legrand, W., Chen, S., Peng, L. "Soft error resilience of Big Data kernels through algorithmic approaches", J Supercomput (2017). doi:10.1007/s11227-017-2042-6

Page revision: 2017-06-05