Distributed System Fault Tolerance Research
In a distributed system,
long-running distributed application programs run the risk of
failing completely if even one of the processors on which
they are executing should fail.
My work in this area involves the development of transparent, low-overhead
mechanisms to allow the survival of the execution of distributed application
programs in spite of such failures.
My Ph.D. thesis
studied the theory, implementation, and performance
of transparent rollback recovery methods
based on message logging and checkpointing,
and I continue to be active in the area of
distributed system fault tolerance research.
I have investigated primarily methods using
optimistic rollback recovery based on combinations of
message logging and checkpointing.
Message logging and checkpointing
methods allow very low-overhead fault tolerance support,
typically adding less than 1 percent overhead
to the execution time of many distributed application programs.
My research has included the
development of a model for reasoning
about and proving the correctness of these systems,
the design of several new logging and recovery algorithms and techniques,
and the implementation and performance measurement
of pessimistic and optimistic message logging.
I have also worked with transparent rollback recovery based on
consistent checkpointing,
and on recoverable distributed shared memory.
Research Papers:
-
Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Technical Report CMU-CS-99-148, School of
Computer Science, Carnegie Mellon University, June 1999.
-
Sean W. Smith and David B. Johnson.
Minimizing Timestamp Size for Completely Asynchronous
Optimistic Recovery with Minimal Rollback.
In Proceedings of 15th IEEE Symposium on Reliable Distributed Systems,
pp. 66-75,
IEEE Computer Society, Niagara-on-the-Lake, Ontario, Canada, October 1996.
-
Sean W. Smith, David B. Johnson, and J. D. Tygar.
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks.
In The 25th Annual International Symposium on
Fault-Tolerant Computing: Digest of Papers,
IEEE Computer Society, Pasadena, CA, June 1995.
-
David B. Johnson.
Efficient Transparent Optimistic Rollback Recovery for
Distributed Application Programs.
In Proceedings of the 12th Symposium on Reliable Distributed Systems,
pp. 86-95, IEEE Computer Society, Princeton, NJ, October 1993.
-
John B. Carter, Alan L. Cox, Sandhya Dwarkadas,
Elmootazbellah N. Elnozahy, David B. Johnson, Pete Keleher, Steven Rodrigues,
Weimin Yu, and Willy Zwaenepoel.
Network Multicomputing Using Recoverable Distributed Shared Memory.
In Digest of Papers: COMPCON Spring 1993,
The Thirty-Eighth IEEE Computer Society International Conference,
pp. 519-527, San Francisco, CA, February 1993.
-
Elmootazbellah Nabil Elnozahy, David B. Johnson, and Willy Zwaenepoel.
The Performance of Consistent Checkpointing.
In Proceedings of the 11th Symposium on Reliable Distributed Systems,
pp. 39-47, IEEE Computer Society, Houston, TX, October 1992.
-
David B. Johnson and Willy Zwaenepoel.
Recovery in Distributed Systems Using Optimistic Message Logging
and Checkpointing.
Journal of Algorithms, 11(3):462-491, September 1990.
Revised version of a paper presented at
the Seventh Annual ACM Symposium
on Principles of Distributed Computing,
Toronto, Ontario, Canada, August 1988.
-
David B. Johnson and Willy Zwaenepoel.
Transparent Optimistic Rollback Recovery.
In Proceedings of the Fourth ACM SIGOPS European Workshop:
Fault Tolerance Support in Distributed Systems, Bologna, Italy,
September 1990.
Reprinted in ACM SIGOPS Operating Systems Review, 25(2):99-102,
April 1991.
-
David B. Johnson, Peter J. Keleher, and Willy Zwaenepoel.
A Simple Algorithm for Finding the Maximum Recoverable System State
in Optimistic Rollback Recovery Methods.
Technical Report Rice COMP TR90-125, Department of
Computer Science, Rice University, July 1990.
-
David B. Johnson and Willy Zwaenepoel.
Output-Driven Distributed Optimistic Message Logging and
Checkpointing.
Technical Report Rice COMP TR90-118, Department of
Computer Science, Rice University, May 1990.
-
David B. Johnson and Willy Zwaenepoel.
Distributed System Fault Tolerance Using Sender-Based Message Logging.
Technical Report Rice COMP TR90-119, Department of
Computer Science, Rice University, May 1990.
-
David B. Johnson.
Distributed System Fault Tolerance Using
Message Logging and Checkpointing.
Ph.D. thesis, Rice University, December 1989.
Also Technical Report Rice COMP TR89-101, Department of
Computer Science, Rice University, December 1989.
-
David B. Johnson and Willy Zwaenepoel.
Sender-Based Message Logging.
In The Seventeenth Annual International Symposium on
Fault-Tolerant Computing: Digest of Papers,
IEEE Computer Society, pp. 14-19, Pittsburgh, PA, July 1987.
David B. Johnson, dbj@cs.cmu.edu.
Last modified February 20, 1996.