Soft-Error Tolerant Big Data Kernels

(Sponsored by NSF Grant CNS-1527318), (Collaborative Project: CNS-1527051)

Introduction

Due to hardware processes shrinking to several nanometers and the scaling up of Big Data analytics infrastructure, the threat of soft error-induced bit flips pose threats on Big Data analytics applications that cannot be ignored. Handling these threads requires exploration on multiple fronts: understanding of algorithms will be conducive to efficient error correction, while understanding of scaling-up will be helpful for efficient checkpointing, which can help error recovery. This page provides the experiment tools that can be used for such exploration.

The Eight Kernels

We have chosen eight kernels from BigDataBench, performed fault injection experients and added resilience techniques to them. The experiment results have shown that these kenrels may be made fault tolerant with a combination of algorithm-level invariants and low-level invariants. The details of the fault resilience techniques are described in [1].

Matrix-Matrix Multiplication

Metropolis-Hastings MCMC Sampling

Fast Fourier Transform

Breadth First Search (BFS)

 MD5 Hash 
 Set Union 
 Quicksort 
 GREP 

Fault injection experiments are performed on the original kernels and fault-resilient kernels to introduce bit-flip errors during run-time. Multiple experiments are performed, with one bit-flip error per run. The results are aggregated to form a comparison and evaluated using kernel-specific error metrics to form a comparison between the original and fault-resilient kernels. Details of the experiments are discussed in [1].

Fault Injection Workflow

Kernel-Specific Error Metrics

The Error Metrics are used to determine for each of the kernels whether an output is correct and if it's not, how incorrent the output is. For each of the metrics, lower is better (meaning a result is less incorrect.)

Kernels	Metric
Matrix Multiplication and FFT	RMS error between faulty and golden outputs
Metropolis-Hastings MCMC	Absolute error between mean of generated samples
Set Union, Quicksort	Number of misplaced elements
Grep	Difference in number of occurrences of the searched pattern
MD5 Checksum	Whether the output equals the golden output

File Download

The Big Data Kernels come with two versions: a version with fault inejction and a version without fault injection.

For the version with fault injection, the kernels need to be compiled and instrumented through LLVM in order to enable fault injection experiments:

For the version without fault injection, the kernels may be compiled natively, no instrumentation is required.

Eight Big Data Kenrels (without fault injection)

For both versions, the following input generator is provided for conveniently scaling up inputs in BFS, MD5 and Grep kernels:

Input Generator for BFS, MD5 and Grep

The next sections contain directions to how to compile the fault injection framework, how to compile the kernels, and how to use the input generator.

If you use this set of Big Data kernels in your work, please kindly cite [1]. Thanks for your interest!

Building the Fault Injector

The ``Fault Injector'' includes both the Fault Injection binary transform LLVM pass (which is derived from the KULFI fault injector from University of Utah), LLVM 3.2, as well as a run-time library. The run-time library are located in the sources of the big data kernels and are compiled into the binary, which is handled by the Makefiles there, and do not need to be compiled with the fault injection pass.
This section discusses the compilation of LLVM 3.2 and the fault injection pass.

The compilation process assumed the user has GCC version 5 (g++-5). If g++-5 is not available, line 3 in the file Makefile_KULFI needs to be modified to match the available g++ version. For example, if there is only g++-4.8 present, that line should be changed into:

CLANGXX=${LLVM_INSTALL}/bin/clang++ -I/usr/include/x86_64-linux-gnu/c++/4.8 -I. -I../.. -std=c++11

1. LLVM 3.2

LLVM 3.2 is used for best compatibility (newer versions such as 3.8 should work but may take longer to build, may be using cmake rather than GNU make, and may require changing inclusion paths in the injector pass.)

It is recommended to do an out-of-source building, which means there are 3 different directories involved in the building procedure: the source directory, build directory and the installation directory. Assume they are set as following:

Source Directory:

~/llvm-3.2-src/

Build Directory:

~/llvm-3.2-build/

Installation Directory:

~/llvm-3.2-install/

The build procedure is as follows.

First extract the LLVM source to the Source Directory (so that the directory becomes the root of the source tree) and create the build directory and do the configuration step from there, setting the Installation Directory as the prefix.

$ cd ~ && mkdir llvm-3.2-build
$ cd llvm-3.2-build

$ ../llvm-3.2-src/configure --prefix=~/llvm-3.2-install
$ make -j4

The above 4-threaded build (specified using the -j4 flag) takes about 20 minutes to complete on i5-3210M.

When the build completes, do an ``install'':

$ make install

You will find the binaries in the installation path. The tools built include clang++, opt, llvm-link and llc, which will be used for compiling the Big Data Kernels:

clang++ is the C/C++ frontend of LLVM.

$ ~/llvm-3.2-install/bin/clang++ --version
clang version 3.2 (tags/RELEASE_32/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix

opt is the LLVM optimizer, and its binary instrumentation capability which is conducive to optimizations is used for fault injection purposes in this project.

$ ~/llvm-3.2-install/bin/opt --version
LLVM (http://llvm.org/):
  LLVM version 3.2svn
  Optimized build with assertions.
  Built May 15 2017 (12:09:18).
  Default target: x86_64-unknown-linux-gnu
  Host CPU: core-avx-i

$ ~/Downloads/llvm-3.2-installed/bin/llvm-link --version
LLVM (http://llvm.org/):
  LLVM version 3.2svn
  Optimized build with assertions.
  Built May 15 2017 (12:09:18).
  Default target: x86_64-unknown-linux-gnu
  Host CPU: core-avx-i

$ ~/Downloads/llvm-3.2-installed/bin/llc --version
LLVM (http://llvm.org/):
  LLVM version 3.2svn
  Optimized build with assertions.
  Built May 15 2017 (12:09:18).
  Default target: x86_64-unknown-linux-gnu
  Host CPU: core-avx-i

  Registered Targets:
    arm      - ARM
    cellspu  - STI CBEA Cell SPU [experimental]
    cpp      - C++ backend
    hexagon  - Hexagon
    mblaze   - MBlaze
    mips     - Mips
    mips64   - Mips64 [experimental]
    mips64el - Mips64el [experimental]
    mipsel   - Mipsel
    msp430   - MSP430 [experimental]
    nvptx    - NVIDIA PTX 32-bit
    nvptx64  - NVIDIA PTX 64-bit
    ppc32    - PowerPC 32
    ppc64    - PowerPC 64
    sparc    - Sparc
    sparcv9  - Sparc V9
    thumb    - Thumb
    x86      - 32-bit X86: Pentium-Pro and above
    x86-64   - 64-bit X86: EM64T and AMD64
    xcore    - XCore

The above messages suggest that LLVM-3.2 has been successfully built.

2. Injector Pass

A ``pass'', according to the LLVM documentation, is a procedure in the compiler that takes a bytecode file as input, performs transformations, optimizations and/or analyses on it, and produces a bytecode file, or some analysis results, as an output.

The particular fault injection pass we use here is a transformation pass, taking a linked program bytecode as input, and produces an instrumented bytecode as output. The instrumented bytecode has codepaths that allow introduction of bit flip errors in the results of certain types of LLVM instructions.

If you have built LLVM-3.2 before, you can find a ``lib/Transform'' directory in the Build Directory (which is created during the build process of LLVM 3.2). You need to add a subfolder in that lib/Transform (in this example, lib/Transform/faults2 is used) to acknowledges the LLVM build system of the new fault injection pass, if the folder (lib/Transforms/faults2) does not exist.

The procedure is identical to the preparatory steps before creating a new LLVM pass.

$ cd ~/llvm-3.2-build
$ cd lib/Transforms
$ ls
Hello  InstCombine  Instrumentation  IPO  Makefile  Scalar  Utils  Vectorize
$ cp -r Hello faults2
$ ls faults2
Makefile  Release+Asserts

Then, edit faults2/Makefile, such that it reads as follows. Make sure to change LIBRARYNAME
to faults2
and remove the line and if statements that set EXPORTED_SYMBOL_FILE
.

##===- lib/Transforms/Hello/Makefile -----------------------*- Makefile -*-===##
#
#                     The LLVM Compiler Infrastructure
#
# This file is distributed under the University of Illinois Open Source
# License. See LICENSE.TXT for details.
#
##===----------------------------------------------------------------------===##

LEVEL = ../../..
LIBRARYNAME =

faults2


LOADABLE_MODULE = 1
USEDLIBS =

include $(LEVEL)/Makefile.common

Then build and install the fault injector module, similar to building and installing LLVM itself:

$ cd faults2
$ make && make install
llvm[0]: Linking Release+Asserts Loadable Module faults2.so
llvm[0]: Installing Release+Asserts Shared Library /(your home directory)/llvm-3.2-installed//lib/faults2.so

After this step, you have the fault injecting compiler infrastructure set up and are ready for the next steps.

Building and Testing Run the Big Data Kernels

There are three variants for each of the kernels:

Non-fault-injected, non-fault-tolerant
Fault-injected, non-fault-tolerant
Fault-injected, fault-tolerant

The building and compiling of each of the variants are largely the same, but the input formats/parameters and outputs may slightly differ.
Specifically, the non-fault-injected variant will not show fault-injection-specific outputs.
The table below shows the outputs from the two fault-injected variants as examples.
Choose one kernel from the list below to view the build procedure and the expected output for the test runs.

Input Generation

The following shows the steps to generate inputs that take about 1 minute to run on an i5-3210M processor.
Depending on the kernel, inputs may be generated with a script, or the Graph Generator / Text Generator of the Big Data Generator Suite in BigDataBench (we include the versioned that's been modified for compatibility in the Download section.)
The parameters may be changed for larger/smaller inputs.

Fault Injection

The fault injector is designed to inject a single bit flip error into one dynamic instruction (also referred to as a "fault site") in a program run.
To perform fault injection, set the environment variable ``NEXT_FAULT_COUNTDOWN'' and run the instrumented binary.

For example, this command sets a "countdown" of 2585333 to the dynamic instruction that is to be fault-injected:

NEXT_FAULT_COUNTDOWN=2585333 ./qsort.bin

When executed, you will see an output that consist of 3 parts:

$ NEXT_FAULT_COUNTDOWN=2585333 ./qsort.bin 
[Fault Injection Campaign details]
   Max interval: 10000000
Reading configuration from environment variables.
   Next fault CountDown = 2585333
   Should initialize randseed = 0
   Bit position for faults=-1
   Dump BB Trace=0

Part 1 of the output simply shows the fault injection campaign (only the countdown to the next fault is utilized; the fault injection run-time has some other features that are not used in this project.)

/*********************************Start**************************************/
Succeffully injected 32-bit Int Data Error!!
Total # faults injected : 1
Bit position is: 7
Index of the fault site : 2856
Total # of fault sites enumerated: 2585334
/*********************************End**************************************/

Part 2 of the output is printed when a fault is actually injected into the program, which includes some details about the current fault-injected run:
- The total # faults line suggests 1 bit-flip error has been introduced so far, which is the error that's been just injected.
- The bit position line suggests the 8th bit (the position is zero-based) has been flipped, which is equivalent to XOR'ing the output of the fault injected instruction with (1<<7).
- The index of the fault site line indicates the ID of the static fault site, namely, the static LLVM instruction.
- The total # of fault sites enumerated line indicates that a total of 2585334 fault sites have been enumerated. (The number may be slightly different from the specified Countdown.)

FAULTINESS: 85
/*********************Fault Injection Statistics****************************/
Total # fault sites enumerated : 4787069
/*********************************End**************************************/

The last part of the output contains the rest of the program output as well as the fault injection summary.
- "FAULTINESS: 85" is the output from the proram-specific error check. In Quicksort, this means 85 elements in the output are located at incorrect places.
- The last line prints out the number of dynamic fault sites that have been enumerated in this fault-injected run. Because a fault injection may cause the control flow of a program to change, this number may be different from that of a non-fault-injected run, or another fault-injected run.

The size of the entire fault injection space could be prohibitively large, so usually a subset of the space is sampled. Details of the process of sampling and making sure the sample size is large enough to be representative of the characteristics of the kernel in question are discussed in [1].

References

Please kindly cite the following paper as your reference if you are using the kernels for your work.

[1] LeCompte, T., Legrand, W., Chen, S., Peng, L. "Soft error resilience of Big Data kernels through algorithmic approaches", J Supercomput (2017). doi:10.1007/s11227-017-2042-6