Anatidae: A distributed system of highly available objects
This project is a join effort between CMU and Bellcore. The following
people are the current members of the research team:
Publications
-
"Experiences Using DCE and CORBA to Build Tools for Creating Highly-Available
Distributed Systems," E.N. Elnozahy, V. Ratan, and M.E. Segal.
In Proceedings of the IFIP International Conference on Open Distributed
Processing (ICODP'95), February, 1995.
Also available as technical report
CMU-CS-95-117.
-
"Highly Available Directory Services in DCE,"
B. Acevedo, L. Bahler, E.N. Elnozahy, V. Ratan, and M.E. Segal.
In Proceedings of the Twenty Sixth International Fault Tolerant computing
Symposium (FTCS-26), June, 1996.
Also available as technical report
CMU-CS-96-???.
Status
- Prototype of a highly available, and correct, DCE CDS services is being
developed.
- Implementation of a prototype of a highly available CORBA-compliant
name server is underway.
Summary
High availability is a major concern for many distributed applications.
A failure in a computer-controlled system could result in problems ranging from
minor inconveniences to its users to a major disaster that leads to
catastrophic loss of human life. Traditional techniques for adding high
availability to these applications typically rely on
nonstandard systems built out of custom software and hardware.
These systems are
expensive and difficult to maintain, and applications written on one
system cannot be easily
ported to other platforms.
We are building a software layer that supports adding fault
tolerance and high availability to distributed applications written on
standard workstations
and operating systems.
The proposed software layer
allows the creation of highly-available
objects that are portable across a variety of heterogeneous hardware
platforms.
Our
techniques are built on the CORBA and DCE standards of distributed computing,
ensuring
interoperability with other systems that conform to the
same standards, and enhancing the
commercial viability of our approach.
Along with these goals, the proposed system allows
legacy applications that were written for custom hardware systems to
migrate to standard
platforms without major software redesign.
The major contributions and innovations that we
claim in this work are:
-
application-transparent fault tolerance.
Our techniques transparently add fault tolerance to
distributed objects with no programmer intervention.
This approach reduces
the complexity of writing distributed applications by relieving
the programmer from
the error-prone and mundane tasks of handling failure and
recovery at the application
level.
The proposed techniques also transparently retrofit legacy applications that
were
designed without consideration for fault tolerance or were
written on fault-tolerant
hardware platforms.
-
software-based TMR systems.
Our approach allows a Triple Modular Redundancy
(TMR) system to be built out of stock workstations connected by
a standard network.
Using custom TMR hardware systems has been a popular technique
for many mission
critical applications.
The proposed techniques enable these applications to run on
standard platforms with the same degree of fault tolerance and high
availability but at
a lower cost.
-
flexibility through multiple mechanisms.
The proposed software layer uses seven
techniques for adding high availability and fault tolerance to
distributed objects, in
contrast to existing systems that are built around a single technique
or abstraction.
Our
approach recognizes that different applications have different
availability
requirements, and allows the application developer to select the level
of availability
and associated performance penalty that are most appropriate for the
application.
-
on-line software installation.
Installation of new versions software to add functionality or repair bugs can lead to downtime.
The proposed software layer includes
mechanisms that install new software releases without
causing application downtime
or significant performance degradation (dynamic program updating).
The Name
Anatidae is a latin word for the family of birds under which geese are
classified. The selection of the name was inspired
by the resilience of the Canadian geese that occupy Bellcore's parking lot.
These geese are able to continue operation not withstanding adverse working
conditions ranging from fatal car accidents to the subfreezing temperature
during the winter and the scorching sun during the summer. The currently
contrived acronym should read as "AN Availability Toolkit for Implementing
Distributed Applications Everywhere".
Back to Mootaz's home page