Architecture Independent Fault Tolerance
In Dome, we are supplying checkpoint/restart functions that are written in
the source language (C++) itself, making them easily portable to multiple
architectures. Unlike traditional fault-tolerance routines, which save core
memory images, Dome just saves the values needed from the higher-level data
structures. A preprocessor helps to modify programs so that the stack and
program counter can more easily be restored without much extra work for the
user.
Advantages of Our Method
- Architecture independence
- Small checkpoints
- Simple, easily understandable implementation
- Checkpoints may be used to provide both fault tolerance and
process migration.
How It Works
- dome::checkpoint() checkpoints all objects in the Dome environment.
- User's code should call dome::checkpoint() on each iteration.
- On restart, dome::restart() restores dome variables.
- A preprocessor modifies programs to save the stack and program
counter.
Status
- Implementation is currently in progress. See our latest tech
report, linked in from the Dome home page.