The Manetho Project
Welcome to Manetho WWW page. The Manetho project started
at Rice University in 1989 as a prototype system to investigate the issues
of providing reliability in network multicomputers.
The prototype implementation was completed in 1993.
For information contact Mootaz .
Publications
-
"A Survey of Rollback-Recovery Protocols in Message-Passing Systems,"
E.N. Elnozahy, D.B. Johnson and Y.M. Wang.
Technical report CMU-CS-96-181, Department of Computer Science,
Carnegie Mellon University, September 1996.
-
"On the Use and
Implementation of Message Logging," E.N. Elnozahy and W. Zwaenepoel.
In Proceedings of the Twenty Fourth International Symposium on
Fault-Tolerant Computing (FTCS-24), pp. 298-307, June 1994.
-
"Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and
Process Replication."
E.N. Elnozahy. PhD thesis, Department of Computer Department of Computer Science, Rice University,
October 1993. Also available as technical report TR93-212.
-
"The Performance of Consistent Checkpointing,''
E.N. Elnozahy, D.B. Johnson and W. Zwaenepoel.
In Proceedings of the Eleventh Symposium on Reliable Distributed Systems,
pp 39-47,
October 92.
-
"Replicated Distributed Processes in Manetho,"
E.N. Elnozahy and W. Zwaenepoel.
In Proceedings of the Twenty-Second International Symposium on Fault-Tolerant
Computing (FTCS-22), pp. 18-27, July 92.
-
"Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast
Output Commit,"
E.N. Elnozahy and W. Zwaenepoel.
In IEEE Transactions on Computers, Special Issue on Fault-Tolerant
Computing, pp 526-531, May 92.
Description
The Manetho system addresses the problem of
providing low-overhead fault tolerance
in distributed systems, with emphasis on high performance during
failure-free operation.
This problem is especially prominent in network multicomputing,
where a number of powerful workstations connected by a high-speed
network offer the processing capacity to run compute-intensive
applications. In such an environment,
it is important to provide fault tolerance without affecting
the failure-free performance of the system.
Manetho provides
application-transparent fault tolerance
to long-running distributed computations.
It is based
on maintaining an antecedence graph,
a technique that
allows
rollback-recovery to co-exist in the
same system with active process replication. Thus, inexpensive
rollback-recovery is used for client processes, while
active process replication is used for server processes where
high-availability is required. Combining rollback-recovery and
process replication allows the system to accommodate different application
requirements, and
differs
from previous work, where a single method is used to provide
fault tolerance for all processes.
Key Contributions
Manetho has been implemented on an Ethernet
network that connects 16~workstations
running the V system. The main contributions and results of the
thesis are:
-
A protocol for message logging
that combines
the advantages of
optimistic and pessimistic message logging systems while avoiding their
disadvantages.
Like optimistic systems, failure-free overhead is small:
the running time increases by
1.5% on average for the distributed applications studied.
Like pessimistic systems, rolling back after a failure is
limited only to the processes that fail.
-
A multicast protocol that supports active process replication. The
protocol
eliminates the
high latency for message delivery that is common in
negative-acknowledgment multicast
protocols, while maintaining high throughput.
Measurements show that the multicast latency
is 30% less than the fastest known
multicast protocol that runs on the same hardware and uses
a similar kernel.
-
An implementation that shows that consistent distributed checkpointing can
be provided with very low overhead, contrary to what is widely believed.
Using a short checkpointing interval of two minutes, the failure-free
running time of the applications studied increases by only an average
of 1%.
The measurements also show that
the overhead is dominated by the cost of accessing
stable storage,
and not by the number of messages used to
coordinate the checkpoints, contrary to previous claims.
-
An efficient solution to the
problem of output latency in rollback-recovery systems.
Sending a message to the outside world does not require
multihost coordination. As a result, Manetho's output latency
is often less than other systems' by an order
of magnitude.
The Name
The old Egyptian civilization had no exact system of chronology. The
priests
usually dated events according to the years of a king's reign.
For this purpose, several lists of kings were maintained
at the various temples throughout Egypt. Some of these
lists survived the decline of the old
Egyptian empire.
The priest
Manetho (Ma-Net-Ho) lived
during the
reign of the Ptolemies, circa 300 B.C.,
at the center of the Nile
Delta in Sebennytus, a place now called Samannud.
He collected the lists
that were preserved in the various temples and used them to
write the history
of Egypt in a three volume book. This
book remained the authentic source for Egypt's history for
several centuries until it was lost, probably during the
fire of the library of Alexandria (circa 390 A.D.).
The operation of the system
metaphorically resembles
what Manetho did to restore
the history of Egypt.
Like old Egypt, the system
does not have access to an exact, global time service.
Each individual process
maintains information about its perception of the system's execution
history, similar to what the priests of old Egypt did.
If a failure occurs, a protocol collects the fragments
of the system's execution history from the individual processes,
and like Manetho, restores the full history of the system. This
history is used to recover from the failure.
Mootaz Elnozahy, mootaz@cs.cmu.edu