Hypervisor-based Fault Tolerance

T. Bressoud and F. Schneider

  This paper presents an alternative to the general methods of providing
fault-tolerant systems: special hardware, special operating system, or
writing it in the application. The problem with hardware-based fault-tolerant
systems is that hardware manufactures general place less emphasis on
fault-tolerant processors, as the market is small, so the speed of available
fault-tolerant systems generals lags quite a bit behind that of general-
purpose hardware.  The problem with operating system support is similar,
as operating systems are very complex, and designing and programming
it to handle redundant hardware is difficult at best, and the system has
to be built for each operating system that is written.  In order to be
able to write the fault-tolerance into the application, either each program
has to be very aware of dealing with faults, or you have to write a
library which knows how to do it, and bottleneck your programmers through it.
  Instead, Bressoud and Schneider propose to implement a virtual machine
on the hardware which understands how to do fault-tolerance.  If the VM
presents to the operating systems a machine that behaves exactly the same
as the underlying hardware, then any operating system that runs on the
original machine runs on the VM.  Moreover, writing a VM for a new hardware
is much more feasible than getting the hardware companies to create the
hardware.
  The virtual machine they wrote to handle fault-tolerance is called
Hypervisor.  They (and this summary) mostly talk about fault-tolerance
with two processors: a primary and its backup, but the system they
describe is easily extensible to multiple backups.  Because of the
nature of their system, they need only one virtual machine to be
running at once (this is contract to IBM's system, which allowed several),
which simplified their efforts quite a bit.
  In order for there system to work, they make several assumptions:

I/O Device Accessibility Assumption
  I/O devices are shared between virtual machines.  Thus, the disks,
the keyboard, the monitor, etc. can be controlled by any processor.

Virtual Machine Assumption
  Virtual machines correctly imitate the hardware.  This is fairly
self-explanatory; if the VM's differ from the hardware, the O.S. may
not work.

Ordinary Instructions Assumption
  An ordinary instruction is one that depends only on the virtual machine
state (which includes memory, contents of registers, etc.).  This is
as opposed to environment instructions, which depend on exterior 
things (time-of-day clock, content of I/O devices, etc.)

  The ordinary instruction assumption states that executing the same
ordinary instruction on two processors in the same virtual machine state
has exactly the same effect.

Environment Instruction Assumption
  During environment instructions, the virtual machines have a chance
to communicate.  This allows the system to ensure that whatever
information was given the primary is also given to the backups.

Instruction-Stream Interrupt Assumption
  A mechanism is available to call the hypervisor when a specified point
in the instruction stream is reached.  On the processor they were working
on, it had a counter that counted down by one per instruction, and when
it reached 0, an interrupt was thrown.  Other options include periodic
explicit calls of the hypervisor, among others.
  This assumption allows the introduction of a concept called an `epoch.'
An epoch is the portion of programs between hypervisor invocations (those
not resulting from environment instruction assumptions).  Epochs must
be the same for all virtual machines for this system to work.

  Given these definitions, they designed a fault-tolerant system, using
only two basic tricks: interrupts are delivered at epochs, and the
result of environment instructions are passed from the primary to the
backup.  Synchronization occurs at epoch breaks.
  The backup always runs one epoch in the past with respect to the
primary.  Thus, at the beginning of the epoch E, the backup has the result
of all environment instructions that it will execute and all of the
interrupts that were delivered to the primary when it started epoch E.
Thus, it is able to execute the next epoch EXACTLY as the primary did.
If the primary failed during epoch E, then the backup will never receive
and "[end, E]" message, and, after a timeout, will recognize that the
primary has failed, then it promotes itself to primary, and, as the primary
failed during its execution of epoch E, it has no responsibility anymore
to behave in the same manner as the primary did.
  The primary ensures that after finishing epoch E, it does not continue
until the backup has received all the messages from that epoch, at which
time it sends an "[end, E]" message to the backup, notifying it that
the backup can start epoch E now, and then the primary starts epoch E+1.

I/O instruction problem

  The major problem with this system as described is lost interrupts upon
failure.  The backup ignores all interrupts that it receives, secure
in the fact that the primary will tell it if it was really important,
as the primary is executing the same instruction seem, so it should
receive the same interrupts.  However, between failure and the end of the
epoch, the interrupts received by the primary are lost.  Thus, these
must somehow be re-issued.  Hypervisor assumes that all I/O devices
have the ability to issue an `uncertain interrupt,' telling the
processor that the I/O may or may not have been performed.  Hypervisor
then keeps track of all pending I/O operations, and, upon failure, issues
uncertain interrupts for them.  Since the operating system must be able
to handle these anyway, this method requires no changes to the 
operating system.


Performance

  Within a reasonable band, the longer the epoch, the better the performance.
They typically looked at epoch lengths no longer than 35,000 instructions.
For small epoch lengths, the processor spends the majority of its time
synching the two processors.  For extremely long epoch lengths, the addition
latency induced by the interrupt delivery method starts to become an issue.
Moreover, for the higher epoch lengths, the speed of communication (10 Mbps
vs 155 Mbps) was no much of an issue.
  The epoch length had a much larger effect on CPU intensive operations,
as the I/O intensive workloads are more bounded by the disk accesses, so
the cost of making sure the two systems are synchronized is lost.  Overall,
the system was 1.5 to 2 times slower than without fault tolerance.

Post Script
 
  The discovered that the limiting factor in their original system was
ensuring the backup had received all the messages it had sent.  They
relaxed this constraint to forcing the backup to have acknowledged
all messages before the primary issues the next I/O operation.  The
justification for this is that I/O operations are the only way the
state of the I/O device is revealed to the virtual machine.  This
alteration helped immensely in CPU intense operations, increasing the
speed by 50% (taking 33% less time), but, not surprisingly, was not
nearly so helpful on the I/O intense operations.  The revised system
is the one discussed in the previous paragraph.