Recovery Management in QuickSilver
Roger Haskin, Yoni Malachi, Wayne, Sawdon, and Gregory Chan

Executive summary:

This paper presents a transaction-based recovery mechanism for a distributed system. The central idea is that atomic transactions can be used as a general recovery mechanism for all different kinds of servers and clients. This is done by exposing the commit protocol and the logging primitives and allowing the clients and servers to use them in implementing their own recovery techniques.

A Transaction-Based Recovery Manager

The recovery manager is implemented as a server process, and it contains two parts:

  1. Transaction manager: Manages commit coordination by communicating with servers at its own node and with transaction managers at other nodes.
  2. Log manager: Serves as a common recovery log both for the Transaction Manager's commit log and server's recovery data.

The basic idea behind recovery management in QuickSilver is as follows: clients and servers interact using IPC messages. Every IPC message belongs to a uniquely identified transaction, and is tagged with its transaction ID (Tid). Servers tag the state they maintain on behalf of a transaction with its Tid. IPC keeps track of all servers receiving messages belonging to a transaction, so that the Transaction Manager (TM) can include them in the commit protocol. TM's commit protocol is driven by calls from the client and servers, and by failure notifications from the kernel. Servers use the commit protocol messages as a signaling mechanism to inform them of failures, and as a synchronization mechanism for achieving atomicity. Recoverable servers call the Log Manger (LM) to store their recovery data.

Transaction Management

A transaction is identified by a globally unique Transaction Identifier (Tid). Every IPC request belongs to a transaction and is tagged with its Tid. Servers tag all resources they maintain on behalf of a transaction with its Tid.

A process calls Begin to start a new transaction. The local TM creates the transaction, gives it a Tid, and becomes its coordinator. The process that called Begin is the transaction's owner. That process can now issue requests on behalf of the transaction and call Commit or Abort.

When a server receives an IPC request on behalf of a transaction it hasn't seen before, it becomes a participant in that transaction. Participants take part in the commit protocol and can issue requests on behalf of the transaction. Participation ends when the server responds to a vote request.

A transaction becomes distributed when a request tagged with its Tid is issued to a remote node. When a process at node A issues a request to a server at node B, the TM at B is registered as a subordinate of the TM at node A. Thus, a transaction builds a directed graph of nodes. Each node only knows which nodes are its subordinates - there is no global knowledge of the transaction topology.

Transaction Termination and Failure

A TM terminates a transaction and initiates commit processing if the owner calls commit, if any participant calls abort, or if there is some sort of failure (possibly another node or a connection).

A TM can fail before it terminates. The failed transaction is continued but the failure is remembered and the transaction is aborted when it does terminate. A TM causes a transaction to fail if a volatile or recoverable participant fails, if a connection to a subordinate fails, or if a subordinate TM reports the failure of the transaction.

Transaction Checkpoints

A call to Checkpoint saves the partial results of a transaction. Changes take effect permanently, and if the transaction is aborted it rolls back to the checkpoint and no further. This is useful for servers with large data sections, as it allows them to recover using a checkpoint without having to write all their data to disk.

Commit Processing

Commit processing in QuickSilver is similar to the standard two phase commit protocol, except that vote requests are propagated through the transaction's directed graph instead of having the coordinator communicate with all of the participants directly.

A TM initiates commit processing in response to a transaction termination condition. To abort, it sends abort requests to all of its subordinates and participants. If it wants to commit, it sends vote requests to all of its subordinates and participants. When a recipient of a vote request is prepared, (recoverably ready to commit or abort), it returns its vote (commit or abort). To become prepared, a TM must receive vote-commit responses to each of its vote requests. When the coordinator is prepared, it commits the transaction by writing a commit record to the log and sends out end requests which contain the outcome of the transaction (end-commit or end-abort). The transaction is done when all end requests have been acknowledged.

A server can choose to be involved in the entire two-phase commit or to just be notified at certain points in the commit processing. Servers using one-phase immediate are notified during vote phase of two-phase protocol. Servers using one-phase standard are notified when the transaction commits or aborts. Servers using One-phase delayed are notified when commit processing has ended.

Coordinator Reliability

Coordinator Migration: If a coordinator has a single subordinate, it can designate the subordinate as the new coordinator. This tends to put the coordinator where the recoverable resources are, reducing the probability of a functioning server having to wait for a failed coordinator.

Coordinator Replication: If the coordinator has several subordinates, it can designate a subset of them as backup coordinators. A hierarchical commit protocol is used between the remainder of the subordinates and the coordinators and a special protocol is used among the coordinators.

Log Manager

The log manager (LM) provides a low-level interface to an append-only log. Log blocks are buffered until the buffer is full or it is forced. Blocks are then written to the online log. The online log is a circular list of blocks and the oldest block is always replaced. If a block is about to be replaced and it has live data (some server still needs it) it can be written to the archive. It is accessible there, but perhaps at a performance penalty.

Each TM has a log-tail which points to the oldest record it will need for crash recovery. When newly written records approach the log tail, LM gives the server the option of taking a log checkpoint. The server can do whatever is necessary to move the log checkpoint forward. Of course, it can always skip this and rely on the archive.

Recovery

When a node crashes and is rebooted, LM scans the log starting at TM's log tail and determines status of each transaction known to TM at the time of the crash. Pointers to the oldest and newest records for each transaction are stored, allowing servers to read only the blocks that contain their data. Finally, LM starts accepting identify requests. Servers can do whatever they like with the log information.

Conclusions

The authors conclude that the recovery management overhead is negligible, even for servers with stringent performance requirements. Further, the recovery manager suits the needs of a wide variety of servers.