Three Year Review
Dan Siewiorek, Carnegie Mellon University
Zary Segall, University of Oregon
December 1995
Contracts:
"Fault Tolerant MACH" N00014-92-J-4139
"Ultra-Dependable Real-Time Computing" N00014-1-94-1-0210
1.0 Technical Objectives
1.1 Fault Tolerant MACH
The current trend in computing is towards open systems employing hardware
and software from multiple vendors tied together by portable software
packages. The UNIX operating system ushered in a new era of user freedom
from proprietary hardware/software platforms that commercial vendors used
to capture customers. UNIX provided a portable environment wherein a piece
of software developed on one system could be moved to another system with
minor effort. Since UNIX was available on a wide variety of platforms, the
user could purchase the most cost effective hardware without incurring an
enormous software redesign effort. MACH extends the UNIX portability beyond
the hardware platform by providing a uniform treatment of both networked
(often called distributed computing) and parallel processing (often called
shared memory) computational models. MACH sets a trend for contemporary
operating systems by employing a microkernel whereby the basic operating
system functions, such as allocate memory or start up a task, are implemented
in the microkernel. Traditional operating system services, such as a file
system, are implemented as servers executing on top of the microkernel.
As computing systems assume more and more critical tasks wherein an error
can have catastrophic consequences, attributes of computer systems other
than just cost or performance become more important. One such attribute is
the ability to tolerate a variety of errors ranging from physical defects
to environmentally induced changes to human errors. There have been fault
tolerant commercial computers for almost two decades. Most fault tolerant
systems have involved proprietary hardware and software, locking users into
a single vendor. Furthermore the user had to select either fault tolerance
or performance - an application could not decide to place some resources on
improving fault tolerance and the remaining resources on performance. No
trade-off between fault tolerance and performance was possible. While a
network of distributed computers running open system software has a natural
degree of redundancy so that physical hardware failures could be tolerated,
software to take advantage of this feature has been slow to develop.
Research has produced software which can tolerate network node failure
assume fail-fast network nodes, implying that faults are either detected
or recovered from before erroneous output can enter the network.
Current open systems such as MACH do not implement the fail-fast model.
The goal of this research was to design and implement a Fault Tolerant version
of the MACH operating system (FT MACH) that adhered to the fail-fast model
and allows the user to select the amount of fault tolerance (including none)
to be allocated to each application.
1.2 Ultra-Dependable Real-Time Computing
Over the past 20 years, benchmarks have evolved from simple, synthetic
programs to comprehensive application suites for measuring the
performance of computer systems, both for users of systems and for designers
of systems. The benchmarks have fostered a sense of competition
among manufacturers to produce faster systems. Today there are no benchmarks
to measure the robustness and dependability of computer systems.
Without benchmarks it is difficult to compare the robustness and dependability
of individual techniques or of complete systems. In addition, relative
progress cannot be measured. The objective of Robustness Benchmarks is to
define measures of robustness, develop methodologies for measuring robustness,
and to implement portable software that can be used to evaluate fault
tolerant systems.
2.0 Technical Approach
2.1 Fault Tolerant MACH
The initial focus was on adding error-detection mechanisms to various
features of MACH. The first step added observability and controllability
to services provided by the MACH run-time library. Library calls are made
to an application built upon the microkernel. The library server has been
modified so that all calls are encapsulated into a standard "envelop"
providing a "flight record" of time, calling parameters, and returns.
The envelop concept has been formalized as the sentry model. In this model,
MACH services are viewed as a combination of all possible execution paths
and data structures involved in serving a request. Sentries are placed at
the entry and exit points of services in order to perform fault management.
Hence, in the sentry model a MACH call can be guarded by more than one
pair of entry/exit sentries. Sentries have been categorized to reflect
their structure and functionality. Four types of sentries have been defined:
Fault Detection Sentries (FDS), Fault Recovery Sentries (FRS), Fault
Monitoring Sentries (FMS), and Validation/Fault Injection Sentries (VFS).
Fault Monitoring Sentries have been implemented for user level operating
system calls in MACH 3.0. These monitoring sentries report call entry and
exit time stamps as well as input/output parameters. The ability to trace
system behavior, particularly in the vicinity of an error, has been very
useful at identifying software "bugs" that appeared under stressing
workloads.
2.2 Ultra-Dependable Real-Time Computing
A methodology has been developed for the construction of user mode
modular robustness benchmarks. The system is stressed with incorrect
system calls representative of the type of errors made by application
designers or corrupted data. The modular benchmarks focus on single
errors to enhance repeatability and to isolate the corrupting input.
The benchmarks are executed on the actual target system is contrast
to fault injection which typically requires modifications to the system,
simulation which is an imperfect model of the system, and physical
methods such as heavy ion bombardment and pin-level injection which
exposes systems to random errors and possible damage.
The Robustness Benchmarks target specific functions of the operating
system (such as the memory allocator, the file system, the communication
subsystem, the runtime library, etc.) and define a class of feasible
faults (such as passing random characters that may have been generated
through communication line noise from remote computing sites) that
are deemed most likely to occur with respect to that operating system
feature. Each benchmark generates a series of test cases and keeps
track of the number of cases which are successfully detected. The
benchmark is robust enough to maintain accurate statistical count even
if one of the tests crashes the operating system.
3.0 Accomplishments
3.1 Fault Tolerant MACH
The first Fault Recovery Sentry for MACH 3.0 implemented journalling.
The Fault Monitoring Sentries are used to capture keyboard/mouse
inputs as well as operating system call input/output parameters from
application programs and to journal these parameters onto permanent stable
storage. For typical interactive workstation user sessions (as opposed to
compute-intensive workstation usage) journalling requires approximately 10
MBytes per hour of storage with CPU overheads ranging from a few percent to
unnoticeable for applications such as word processing, drawing packages, and
desktop publishing. Recovery after a crash is totally automatic and all
data, except perhaps for the last keystroke, is automatically recovered
through the replay of the journal. Journal replay time is a function of
the amount of user interaction and the amount of computationally-intensive
time. For interactive user-oriented sessions the replay time is typically
around ten percent of the original session. Journalling Fault Recovery
Sentries has been demonstrated with a wide variety of Unix-based applications
and do not require detailed knowledge of the application's internal structure.
The second Fault Recovery Sentry for MACH 3.0 implemented check-pointing
and rollback. A novel solution to capturing a checkpoint of multiple
concurrent tasks coupled with journalling reduce the amount stable storage
requirements to a total of 10 MBytes and recovery time to a few minutes with
check-pointing occurring as a background activity.
3.2 Ultra-Dependable Real-Time Computing
Based upon our previous study of Robustness Benchmarks the technology was
applied, under separate funding, to an Air Force Satellite computer ASCM
based upon the 1750A instruction set by the IBM Federal Systems Division at
Manassas, Virginia. Using a systematic, modular approach the parameters for
operating system calls were identified as well as likely error
manifestations. A watchdog program executed a series of operating system
calls with a variety of the illegal parameters. Almost 20,000 tests were
executed resulting in dozens of cases causing warm restarts of computer
modules and one case of a cold restart. Approximately one-fourth of the
operating system calls were thus tested in this ADA environment. The
Fault Monitoring Sentries have been used to observe MACH 3.0 system behavior
prior to a benchmark induced fault. If the fault results in a system crash,
we attempt to generalize the system state that induced the crash so
that a robustness benchmark can be designed to probe that single feature.
The Robustness Benchmark methodology has been applied to an aerospace
fault tolerant computer. It has work has been extended and ported to test
the Mach operating system. Data is currently being collected and a
paper will be written shortly. A kernel-level fault injection and test
environment is being implemented utilizing the Sentry mechanism. This
environment will be compared to the Mach Robustness Benchmark results.
4.0 Importance of the Accomplishments
The concept of Sentries has been defined, designed, implemented, and
demonstrated. Sentries are implemented as middleware between unmodified
application code and unmodified operating systems. Sentries intercept
operating system service requests from the application. Sentries can
provide services both on entry to the operating system and upon exiting back
to the application. Services provided by the sentries enhance the
observability and controllability of the system. Several classes of
services have been identified including: journalling for roll-back,
assertions for error detection, replication for fault tolerance, fault
injection for validation, etc.
Sentries represent a framework for producing highly-dependable systems
from commercial off-the-shelf hardware and software. Unmodified legacy
application software can be turned into fault-tolerant services. The sentry
mechanism has been demonstrated with journalling applied to the Mach
operating system for workstations and the Windows operating system for
personal computers. Journalling sentries allow complete recovery of even
multitasking legacy application software from errors induced by
hardware (e.g. power outage), software (undoing a system call which led
to a system crash), and operator mistakes (e.g. undoing the previous
command).
The Robustness Benchmark methodology has been effective at discovering
design flaws in error detection/handling mechanisms is both commercial
and dedicated aerospace fault-tolerant systems.
5.0 Transitions of Research
5.1 To Navy and DOD Organizations
5.2 To Industry
The commercial relevance of the Sentry technology, supported under these
ONR grants, has been recognized through the award by ARPA of a SBIR to
Systems Technology/Development Corporation (ST/DC) of Reston, Virginia.
Initial results during Phase I of the SBIR have demonstrated that
Sentry technology provides highly effective levels of fault tolerance,
with one to two orders of magnitude reduction in costs compared to
proprietary hardware/software solutions. Unlike most commercial fault
tolerant systems, which require users to modify and recompile their
application code, the application transparency features of Sentries
achieves fault tolerance capabilities without modifying application
code. A commercial product based on the sentry technology is planned
for the first quarter of 1996. Negotiations are underway with two
developers/distributors of PC based software for the Sentry based
commercial product. Symantec, Dynamics, and Martin Marietta have
expressed firm interest to help transition this technology into
their current products and applications.
6.0 Research Papers
Gupta, A. P., W. P. Birmingham, D. P. Siewiorek, "Automating the Design
of Computer Systems," IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, Vol. 12, No. 4, pp, 473-487, April 1993.
Siewiorek, Daniel P., Asim Smailagic, "A Case Study in Embedded-System
Design" The VuMan 2 Wearable Computer," IEEE Design and Test of Computers,
Vol. 10, No. 4, 1993.
Russinovich, M. Z. Segall, D. P. Siewiorek, "Application Transparent Fault
Management in Fault-Tolerant MACH", Proceedings 23 International Symposium
on Fault Tolerant Computing, Toulouse, France, pp. 10-19, June 1993.
Siewiorek, D. P., J.J. Hudak, B.-H. Suh, Z. Segall, "Development of a
Benchmark to Measure System Robustness", Proceedings 23 International
Symposium on Fault Tolerant Computing, Toulouse, France, pp. 88-97, June
1993.
Hudak, J., B. Suh, D. Siewiorek, Z. Segall, "Evaluation and Comparison of
Fault-Tolerant Software Techniques", IEEE Transactions on Reliability,
Vol. 42, No. 2, pp.. 190-204, June 1993.
Mukherjee, A., "Measuring Software Dependability by Robustness Benchmarking",
Technical Report CMU-CS-94-148, May 1994.
Russinovich, M., "Application-transparent fault management", PhD
Dissertation, Electrical and Computer Engineering, Carnegie Mellon
University, August 1994.
Russinovich, M. and Z. Segall, "Application-Transparent Check-pointing
in Mach 3.0/UX", 27th Hawaii Int. Con. System. Sciences, Jan. 1995.
Dingman, C. P., J. Marshall, D. P. Siewiorek, "Measuring Robustness of a
Fault Tolerant Aerospace System," Proceedings 25 International Symposium
on Fault Tolerant Computing, Los Angles, CA, pp. 522-527, June 1995
Russinovich, M., Z. Segall, "Fault-Tolerance for Off-The-Shelf Applications
and Hardware", Proceedings 25 International Symposium on Fault Tolerant
Computing, Los Angles, CA. pp 67-71, June 1995.
Siewiorek, D., "Niche Successes to Ubiquitous Invisibility: Fault-
Tolerant Computing Past, Present, and Future", Special Silver Jubilee
Proceedings, International Symposium on Fault Tolerant Computing,
Los Angles, CA., June 1995.
Christopher Dingman, Joseph Marshall, Daniel Siewiorek, "Measuring
Robustness from the System Call Level", 4th IEEE International Workshop
on Evaluation Techniques for Dependable Systems, San Antonio, TX, October 1995