[yapc 99 talks]
YAPC | talks

Braam, Callahan, and Schwan

Peter J. Braam , SCS Carnegie Mellon University
Michael J. Callahan , Ask Jeeves, Inc.
Philip I. Schwan , SCS Carnegie Mellon University

The InterMezzo Distributed File System

45 minute talk

Our target was to see if we could build file systems with features like those of Coda using a far smaller amount of code. We chose to use existing disk file systems as a cache, supplemented by a filter driver (aka stackable file system) which communicates with user level cache managers and file servers to manage currency and caching of data.

The user level file server and cache manager will be the central topic of this paper: they were written in Perl.

Cache managers and file servers are particularly complex servers. A cache manager, for example, is a server for the kernel to supply cache data, and a client for the file servers to fetch data not yet cached. Advanced features such as write back caching and reintegration make file servers both servers to and clients of cache managers. In short the distinction between clients and servers is blurring, and advanced RPC mechanisms are needed.

While this request processing is in progress, very much state is shared, and must be carefully protected. Traditionally, threads have been used and this leads to complicated locking schemes. Asynchronous request processing has shown to be a very attractive alternative in settings like the ACE toolkit, ERLANG and in Windows NT and VMS kernel code. For InterMezzo our servers and cache managers, named collectively "lento", use the latter approach.

Lento is an event driven finite state machine, using the POE package and various extensions. Inside lento many sessions exists. Sessions react to events: our events are triggered by arrival of data and timers. When data has been delivered, it is either handled by an existing session, or it instantiates a new session. For example, if the kernel asks to refresh a directory, a session is set up for this, unless the directory is already involved in a session, in which case and event is posted to that session.

The data arrives over the network and from the kernel and is delivered by so called Wheels which are sessions handling asynchronous I/O. Wheels have Drivers for reading and writing and are supplemented by filters that assemble high level structures out of raw binary data. Some filters come with POE, such as a FreezeThaw filter. Others were constructed by us to deal with kernel user communication. In the near future we will also need Wheels to do asynchronous file I/O, possibly involving transactions for tied databases. Other Wheels accept incoming TCP connections, and set up client sessions for these. The Wheels form non-blocking portals to the core of the cache manager and file server.

The core sessions encode the semantics of file sharing. For example, the file server session may get a write lock request on a volume and will notify the readers that the data is about to be modified, before granting the write lock. Note that this in itself involves sending out multiple requests, collating the answers and dealing with readers that have timed out - something well beyond elementary RPC processing.

The performance of InterMezzo features somewhat high latency due to giving the clients great autonomy. After this latency read performance is that of local disk file systems, and tree modifications might be up to 5 times faster than NFS.

We hope to have a first release ready before the Perl Conference and our estimate is that the total amount of code (including kernel code for Linux and Windows NT, file servers and cache managers) will be around 12,000 lines - 8,000 of which is Perl.

Peter J. Braam is a Senior Systems Scientist at CMU where he is leading the Coda Project. Peter's interests are in distributed file and storage systems and he has worked on a variety of projects in this field, almost always involving Linux as the primary platform. Peter received his PhD from Oxford University in 1987 and subsequently held faculty positions at the Unversity of Utah and at Oxford, before joining CMU in 1996. Peter is also President of Stelias Computing Inc., which specializes in consultancy on Linux file and storage systems. When not at work, he can perhaps be found in his Alaska cabin, or on a nearby mountain.

Michael Callahan is currently Director of Advanced Development at Ask Jeeves, Inc. Previously he has worked on networking and distributed filesystems, scientific visualization, and digital micropayments. He was trained as a mathematician at Harvard and Oxford Universities.

Phil Schwan is an undergraduate studying computer science and mathematics at CMU, intending to graduate in 2002. Past projects include a highly Linux-optimised FTP server, LCD panel driver work, and as-yet-unreleased GNOME dabblings. He looks forward to spending more time in the great white north as a member of the Puffin Group.

Contact:
Peter J. Braam
School of Computer Science, CMU
5000 Forbes Ave
Pittsburgh, PA 15213 Day: 412.268.5295
Fax: 412.268.5576


Kevin Lenzo
Last modified: Fri May 7 15:17:22 EDT 1999