Project 4: Writing a File System

Table of Contents

Project Overview

By now you should have a thorough understanding of how to implement a simple kernel on the x86 architecture. Features like multiple disjoint address spaces, context switching, and system call implementation should no longer be a mystery. In addition, you have now been exposed to a somewhat large scale systems programming project (both in size and complexity) if you had not been exposed to one before. Indeed, the kernel project is a valuable assignment.

The kernel you created in the previous project has one fatal flaw however - there is no way for users to save data from one period of system uptime to the next. Worse yet, there is not even a way for a program to save data such that it can be restored later by another incarnation of the same program (for example, a simple text editor opening a file created an hour earlier by a user).

This type of data is called persistent data, and one way to achieve it is with a file system. Using various forms of media we can save data so that it can be recovered later, even if the system goes down for some period of time or the accesses are from different processes with disjoint address spaces.

While the file system you will implement in this project is fairly simple, it will increase the utility of your operating system greatly. Users will be able to create files and store data in them. These files will then be available to other processes, even after the system has been rebooted.

Like the kernel project you have just completed, the file system is a four week project. It will again be very important to adhere to a reasonable schedule throughout the four weeks to ensure completion of the project. The file system probably involves the most code out of the four projects in this course. This causes problems not only in writing the code, but in finding bugs as well, so be prepared and make sure you leave time at the end for plenty of debugging.

Goals

Technology Disclaimer

Because of the availability, low cost, and widespread use of the x86 architecture, it was chosen as the platform for this sequence of projects. As its creator, Intel Corporation has provided much of the documentation used in the development of these projects. In its literature Intel uses and defines terms like interrupt, fault, etc. On top of this the x86 architecture will be the only platform used in these projects.

The goal of this project set is certainly not to teach the idiosyncrasies of the x86 architecture (or Intel's documentation). That said, it will be necessary to become accustomed to the x86 way of doing things, and the Intel nomenclature, for the purposes of completing this project set. Just keep in mind that the x86 way of doing things is not the only way of doing things. It is the price to be paid for learning the principles of operating systems on a real world system instead of a simulated architecture.

Important Dates

Monday, March 31st: Project four assigned.
Monday, April 7thth: Checkpoint one.
Monday, April 14th: Checkpoint two.
Monday, April 28th: Project four is due at midnight (Monday night).

Groups

The file system project is a group assignment. You should already be in a group of two from the previous project. If you are not in a group, or you are having other group difficulties, send email to staff-412@cs.cmu.edu.

Grading

The primary criteria for grading the file system will be correctness and robustness. Does the kernel crash if you pass a null pointer to your write system call? File systems are expected to take abuse and continue to run.

Although it is not the first thing we will look at, style is also important. Use an appropriate level of commenting, name variables meaningfully and consistently, and try not to resort to global variables when possible. Style is an especially important consideration when dealing with large amounts of code.

It is also our intent that part of the project 4 grade will be based on a face to face interview process. We reserve the right to give partners in a group different project grades if circumstances make that necessary.

Please hand in a design document with your file system. It should talk about all the important design decisions you made, what works, and what does not. If you tell us about something that does not work, it puts you in a better position to be given more partial credit than if you do not tell us and we stumble across it ourselves, because it lets us know that you understand that there is a problem and have spent some nonzero amount of time trying to find it.

Handin

The procedure for handing in project 4 will be the same as it was for project 3. Look for more specific handin directions closer to the project due date.

Hardware Primitives

Projects one and three involved quite a bit of hardware interaction. To build a file system however, we require only a couple basic abilities - the ability to read and write blocks to persistent storage. In this project, as is the case in many environments, we will be using a hard disk for persistent storage.

The Organization of a Hard Disk

A hard disk is composed of one or more disk-shaped pieces of magnetic media. Unlike floppy disks (which contain magnetic media that is, well, floppy), the disks in a hard disk are rigid, hence the name hard disk. Each disk in a hard disk is referred to as a platter. Information can be stored on either side of this platter, and each side of a platter is referred to as a surface.

Each surface is further divided by concentric circles organizing it into tracks. The last step is to divide each track into some number of sectors. The sectors within a track occupy an equal amount of surface area. These sectors are the unit of transfer for a disk; you may not read or write less than one sector. For most modern hard disks, a sector contains 512 bytes.

Modern hard disks get more complicated than this - for example, they compensate for the fact that with this simple scheme, tracks near the outside of the surface would have much larger sectors than the tracks near the inside. However we will not go into such details here.

In fact, through a convenient abstraction known as logical block addressing, or LBA addressing (yes, that is technically logical block addressing addressing, but that is unfortunately the way the term is used), the geometry of the disk need not be a concern to you. LBA addressing numbers the sectors throughout the disk (all sectors on all surfaces) from zero on up, so that the disk appears to the system programmer like a large array of sectors. This simplification does not come completely for free (you should think about why), but it does make your job easier for the purpose of designing and implementing your file system.

By convention, the first sector of the disk (sector zero) is not available for use by the file system. This sector is the boot sector. It contains important information such as a bootstrap program or partition table information. For this reason overwriting the first sector of the disk normally has catastrophic effects. You must ensure that your file system never writes to sector zero.

Blocks vs. Sectors

A block is different than a sector in that it is the minimum unit of transfer for the file system. You may choose to make blocks equal to one sector, or you may choose to have blocks larger than one sector. A block must be an integer multiple of sectors, however.

Using the Disk

Like the devices you supported in projects one and three, communication with the disk is done through I/O ports using functions like inb and outb. Configuring and using the disk is a little more complicated than reading scancodes from the keyboard. The disk must be properly configured first, and there are several steps needed to initiate a read or a write. For this reason we have supplied you with a very simple disk driver. It is available in the library for use with this project, libproj4.a. We have provided a version of the source in the tarball, however this is only for you to look at - please do not try to compile the disk driver we provide in the tarball..

It is important to remember that only one request may be sent to the disk at any given time. If you try to initiate an IDE transfer while another transfer is already pending, an error will occur and the device driver will cause your kernel to panic.

The disk driver has the following interface (defined in disk.h):

int disk_init()
Installs the interrupt handler for disk interrupts into the IDT, blocks until the disks are ready (on power up a real disk may need time to begin responding to commands), and finds out how big the disks are. SUCCESS and ERROR are returned as appropriate. You will need to call this function, directly or indirectly, during the setup phase of your kernel.

void ide_read_sector(int lba_addr, char *dest, int num)
Initiates an IDE operation to read the specified sectors from the disk. The sectors will be stored sequentially starting at the address dest.

void ide_write_sector(int lba_addr, char *source, int num)
Initiates an IDE operation to write the specified sectors to the disk. The sectors will be taken sequentially starting at the address source.

int get_num_sectors()
Returns the number of sectors available on the disk. You must call this after calling disk_init() (and having disk_init() return SUCCESS).

In order to use the disk driver, you must provide two callback functions, disk_read_callback() and disk_write_callback(). These two functions will be called when a read or write completes. You will most likely need to put some process management code in these callback functions.

File System Design

File systems come in a large variety of shapes and sizes. The file system you need to implement is not very complicated, but it does have some important features that you should take a moment to consider.

Files

While a disk contains sectors of a fixed size, user-level programs only see the contents of the disk through the file abstraction. A file is, as far as programs know, a sequence of contiguous bytes, even though the bytes that belong in a file may not be physically contiguous on disk. A file has a name by which it is created and found. Your file system needs to allow a variety of operations on files. Some identify the file by its name, while others use a file descriptor. A file descriptor contains information relevant to active (opened) files, including, for instance, the current offset into the file and the location of the file's metadata cached in memory. A file descriptor is created from a file name as a result of the open or create system calls.

Note that you need to manage file descriptors carefully when a process fork()s or exec()s. You should use UNIX semantics when managing file descriptors in these situations. This implies that the child process gets its own copy of the parent's file descriptors. These new descriptors reference the same underlying objects, so for example a seek on the file descriptor in the child will change the offset of the file descriptor in the parent. File descriptors are not destroyed when a process exec()s.

Naming

The hierarchical namespace of your file system will be similar to what you are familiar with from UNIX environments. A collection of variable-sized storage units (files) are referred to by name. File names can be grouped into directories. Directories can also include other directories, or a mix of files and directories. This creates a hierarchy of directories. Directories can contain any number of files and directories, without limits other than those imposed by the physical size of the disk. The directory tree starts at the root directory, denoted by "/". The character "/" also serves as a file separator. As in UNIX environments, the name ".." refers to the parent of the directory that contains it, and "." refers to the containing directory itself. The name "../foo" then would point to the file "foo" in the parent directory.

Hierarchical names, known as paths, may be absolute or relative. Absolute paths start with "/". Relative paths are transformed into absolute paths by your file system. Your file system needs to maintain a notion of the current directory of a process. This current directory is modified with the chdir system call. The current directory, which is a path itself, is prepended to relative paths to produce the corresponding absolute path.

A link creates a file name reference that refers to an existing file under an additional name. Your file system should support both early-binding (hard) and late-binding (symbolic) links. This allows the naming tree to be made into a directed graph.

As in UNIX environments, directories are stored as files whose contents are interpreted by the file system.

Fragmentation

Your files will contain an integral number of sectors. Even though the space in a sector may be only partially used, the entire sector is reserved for that file. This means that the file system may suffer from internal fragmentation. However, the file data need not be contiguous on disk. Adjacent disk sectors may contain data belonging to two different files, and a file's sectors need not be adjacent to each other. That means that the file system need not and must not be vulnerable to external fragmentation.

Persistent Metadata

Your file system exists to store application data in files. To manage the location and grouping of raw data into files, your file system will need to build data structures. These data structures associate directories with file names, and file names with collections of disk sectors. These data structures are collectively referred to as metadata. This metadata must be written in the same disk that holds the file data. You must decide which disk sectors are used to store data and which ones will be used to store metadata.

Metadata on disk is inconsistent when one part of the metadata indicates one fact and another part of the metadata contradicts that fact. For example, an inconsistent metadata set might indicate that a file exists, and at the same time indicate that the file does not appear in any directories. The metadata on disk is also inconsistent when some part of the metadata asserts a fact that is not reflected by the data present in the disk. For example, the metadata might indicate that a file is 4 kilobytes long but only the first 2 kilobytes of the file might have been written to disk.

Your file system need not constantly maintain consistency in the metadata stored on disk. You may assume that when your file system starts the metadata on disk is consistent. You must ensure, however, that immediately after an application calls the sync() system call the metadata on disk is consistent. At all other times during execution, the metadata written on disk need not be consistent.In the real world, if a system crash occurs when the disk metadata is inconsistent, a utility will make it consistent before the file system is restarted.

Limits

Note: A single file may be opened multiple times, each with its own instantiation of the relevant state of open files. For example, a file may be accessed with two different current file offsets through two different instances of open. It is also possible that two different file descriptors may refer to a single instantiation of an open file. In this case, the two file descriptors would share and update the current file offset concurrently.

Performance

Your file system need not guarantee that the lookup of a file inside a directory is fast. In particular, your file system may take longer to locate a file inside a large directory than in a small one. You need not use sorted and/or balanced data structures to implement directories, but you may do so.

Your file system should perform well, however, for random data access within a file. Accessing a file's data has two components: first, looking up the metadata to determine which disk sector B contains the data at file offset off, and second, accessing the disk sector B. Your file system should guarantee that lookup of B given off takes no longer than O(log off). This means, for example, that you cannot use a linked list to store the sector numbers that form a file. Also, the requirement that any disk be able to contain up to I directories and/or files, any of which may be the size of the disk, implies that you cannot simply use a directly indexed array for this purpose. Think about why this is true.

Buffer Cache

Because the disk is very slow, your file system must try to amortize the large cost of reading and/or writing one sector over several data and/or metadata accesses. It achieves this by loading the data from disk sectors into memory, and allowing several file system operations to access the memory copy rather than the data directly on disk. This would be rather pointless if the disk copy of the data needed to be updated every time the memory copy was updated. Therefore your file system must implement a write-behind policy for its buffer cache. This policy dictates that when the cached data (or metadata) is written, the corresponding disk data need not be immediately updated. The disk data is only updated when: The update policy for cached data is one parameter of caches. Another parameter of cache design is the replacement policy. The replacement policy dictates which cached blocks will be evicted when the cache is full and space is needed in the cache for newly needed blocks.

A reasonable replacement policy for the buffer cache is least-recently-used (LRU.) This policy dictates that the block which has been used least recently among those currently in the cache is chosen as the victim when a free block is needed. The LRU policy may be modified to prefer victim blocks that have not been modified since they were loaded into memory from disk, because this would avoid an extra disk write. LRU may be further modified advantageously to apply different treatment to different types of blocks. For instance, data and metadata often have different patterns of access, and may benefit from different LRU queues. More exotically, large files (such as those used by databases) are typically accessed in more random patterns than shorter files are (which are often accessed sequentially). This fact may or may not warrant a modification of the cache replacement policies. Your buffer cache should hold no more than 256 sectors. You may choose to divide this space among several separate pools for the different kinds of access (e.g, data, metadata, prefetch, and whatever other clever things you may dispose,) but the total number of sectors cached may not exceed 256.

File System Interface

Your file system must support the following interface. The system call numbers are defined in syscall_nums.h. Please continue to use the same system call convention (i.e. the same software interrupt, same registers, etc). As in project 3, argument structures have been defined for system calls that take more than one argument. They are provided in fs_types.h.

Using the Reference Kernel

We encourage you to use your kernel from project 3 during project 4. If for some reason you feel that is not possible, please meet with the course staff to discuss using a reference kernel.

Hints on Implementing a File System

Think about the data structures you will need before you start writing code. As was the case in the kernel project, writing pseudocode here will help you a lot. Go through each system call and sketch out the code. At the same time, you will be refining the data structures for your file system.

There are two main kinds of data structures required for the file system - those that reside on disk, and those that never get written to disk (buffer cache data structures, for example). It is best to start by planning the first type of data structures.

Metadata Design

Remember that:

These constraints will likely dictate that you have some of the following:

Think about what data structures you will need, and how the various file system calls will use them. Increasingly refine the system calls, looking for functionality that is needed from within several of them, e.g, finding the disk block that contains a given offset within a file, finding the nth directory entry within a directory, or finding an open file structure given a file descriptor. There are many such opportunities for sharing code among the system calls: write a legible and maintainable program by separating common functionality into procedures.

At this point, you will be done with design. You will have designed:

The rest of the recommended approach is listed in the "Plan of Attack" section.

A User Level Buffer Cache

There is a neat trick that you can use to simplify the debugging of your file system code. Simics uses raw files for its disk images. This means that you can create a version of your buffer cache that, rather than interacting with the hard disk device driver in Simics, simply seeks and reads/writes to a regular file in user space under Linux or Solaris. If you are careful in your design of the buffer cache and how the file system interacts with it, the changes required to create a user-level version of the buffer cache are minimal.

The hard disk image file is organized simply as an array of sectors. The first 512 bytes in the file correspond to the first sector, the second 512 bytes correspond to the second sector, etc.

It is not required that you create a user-level buffer cache, however doing so will allow you to also create a user-level tool for exploring the hard disk image outside of Simics and performing operations on it. Also, you may find that implementing a user-level buffer cache and single-process draft versions of the system calls may significantly speed up your project, as the debugging environment will probably be simpler.

Requests for Help

Please do not ask for help from the course staff with a message like this:

"I'm getting the OSKit default trap handler telling me I have a general protection fault. What's wrong?"

or

"My create system call crashes the kernel! What's wrong?"

The problem with both of these messages is that they do not follow the initial information-gathering steps which you must perform yourself. Debugging is actually one of the best ways to learn operating system concepts.

Having said that, if you have spent reasonable amount of time trying to solve the problem and are not making any progress, do not hesitate to ask a question. But when you do, provide evidence that you have spent some time trying to solve the problem yourself.

Plan of Attack

Like the kernel project, the file system is a great deal of work. Although it is not necessarily as conceptually challenging, it is probably more code. Do not be fooled into thinking you can waste two of your four weeks. It is critical for you to start the project right away if you wish to finish on time and do well.

To help you, we have established a recommended plan of attack. Hopefully this will give you an idea of how to start. This also establishes the criteria used for the checkpoints.

  1. Read this handout thoroughly.

  2. Re-read the complicated parts to make sure you understand what's being said.

  3. Think about it all for a while.

  4. Write up pseudocode for all the system calls. At the same time, start sketching out the layout of the disk and design your data structures. Use the previous hints section as a guide for this step.

  5. Work on the buffer cache. The buffer cache can be seen as a layer underneath the file system. The buffer cache need not understand much about the organization of the disk. Usage of the disk driver, as well as process blocking and rescheduling, can be completely encapsulated in the buffer cache to make the design cleaner and simpler.

  6. Test your buffer cache by simply reading and writing some blocks to the disk. At this point you are simply trying to verify that the blocks go where you think they are, and that the process management is working correctly.

  7. At this point you have enough to pass checkpoint one.

  8. Write mkfs and the helper functions needed by it. Test it. This should be made easier by using a user-level buffer cache (described in this document).

  9. Write more of the common helper functions. Choose the order of these carefully - it will be difficult to test some of them on their own.

  10. Complete the create, open, read, write, and close system calls. If you have done the previous step thoroughly (and tested the helper functions), this step is not as bad as it seems.

  11. You have enough to pass checkpoint 2.

  12. Implement the remaining system calls.

  13. Test your file system extensively.

  14. You are done with the 15-412 Project Set! Go celebrate, and then start studying for the final. :D

Steve Muckle
Last modified: Mon Mar 31 20:50:41 EST 2003