15-749: Engineering Distributed Systems
Syllabus and Course Organization
Overview and Goals
The goal of this course is to give students the knowledge and skills
required to create and evolve the complex, large-scale distributed
systems that society will increasingly depend on in the future.
The course will teach the organizing principles of such systems,
identifying a core set of versatile techniques that are applicable
across many system layers and have remained invariant across enormous
technological change. Students will acquire the knowledge base,
intellectual tools, hands-on skills and modes of thought needed to
build well-engineered distributed systems that withstand the test of
time, growth in scale, and stresses of live use. Strong design
and implementation skills are expected of all students.
Although designated as a graduate level course, it may be taken by
well-prepared undergraduates with excellent design and implementation
skills in low-level systems programing. The course assumes a high
level of proficiency in all aspects of operating system design and
implementation. Our goal is to impart both knowledge and
skills. We want you to acquire a deep understanding of the
engineering principles involved in designing and implementing
mission-critical software. We also want you to be able to
translate these principles into working systems. The course will
consist of both lectures, where concepts will be discussed, and a lab,
where hands-on experience will be gained. We hope that you find
this an exciting, stimulating, and fun course in which you learn a lot
of valuable knowledge and skills that serve you well throughout your
career.
Course Topics
Caching for performance and availability
- origins of temporal and spatial locality, historical background
- versatility: hardware, local and distributed file systems, web, database
- design dimensions: level, granularity, persistence, timeliness (write-back/write-through, trickle reintegration), update safety
- challenges: maintaining coherence at large scale, rapid cache
validation, failure resiliency of disconnected updates, transparency
vs. translucency
Prefetching for performance and availability
- transforming high bandwidth into low latency (probabilistically)
- versatility: OS read-ahead, chunking or whole-file transfer, virtual memory replacement policy, hoarding, data staging
- dimensions: level of application, amount of advance notice, cache advantage, cost of mispredictions
- source of hints: explicit user advice, compiler directives, runtime observation
- challenges: interference with demand misses, mispredictions, resource hogging
Content-Addressable Storage and Deduplication for performance and storage efficiency
- Concept of cryptographic hashes and challenges: collision resistance, computing overhead of hashes
- Versatile applications of CAS: rsync, LBFS, Lookaside caching, Pastiche, Centera, DHTs, Cedar
Damage containment for reliability
- origin of failures in distributed systems
- fail-stop and Byzantine failures
- failure detection and isolation: parallels with non-CS techniques
- atomic transactions: concepts and implementation, recoverable virtual memory
- distributed transactions and two-phase commit
- implementation challenges and tradeoffs: cost of boundary
crossings (e.g. read() vs. mapped I/O; cost of marshalling and
unmarshalling arguments, threads vs. tasks), time and space cost of
integrity checks, logging and state restoration, commit delay versus
timeliness
Replication for availability
- replica control: pessimistic and optimistic replica control, voting and quorum-based strategies
- challenges: update propagation, interpreting silence
- cache-based disconnected operation; first and second-class replicas
- update conflict detection and resolution
Challenges of longevity and scale
- centralization and decentralization: economies and diseconomies of scale
- apparent scale reduction for performance and manageability
- abstraction and hierarchy: nearly decomposable
systems, hierarchical naming, independent/autonomous name
assignment, AFS and Coda volumes, rapid cache validation,
wholesale and retail resource allocation
- challenges: name space fragmentation, internal fragmentation,
importance of contiguity, non-CS metaphors
- use of hints for performance and availability
Designing to Anticipate Human Foibles
- limitations of individuals: attention span, reaction time,
interactive performance and system latency, behavior under stress,
cognitive and perceptual limits, trust limits, limits of working memory
and long-term recall
- limitations of large groups: evils of anonymity, spam, junk mail and postage, tragedy of the commons
Hands-on Projects
A series of design and implementation projects are an integral part of
the course. These are substantial projects that embody concepts
taught in the lectures. The projects are done individually (i.e.,
not in groups).
We will loan you a laptop for the duration of the course. This
will run a minimal Linux system at the bare metal level.
You thus have 24x7 access to the lab hardware for this course, and
don't need to compete with other students/courses for shared
access. You have complete control over the machine including root
access.
We will also loan you an Android smartphone for Project 4.
Please treat the laptops and smartphones as if they were your own, and
return them in good condition at the end of the course so that they can
be reused.
Resources
There is no textbook. Slides used in class will be placed in AFS.
Course AFS area: /afs/andrew.cmu.edu/course/15/749
Remember that you need to be authenticated to the andrew.cmu.edu realm to access this material.
Required and optional readings are available through the course web site.
Course mailing list: 15-749-everyone@lists.andrew.cmu.edu
This sends email to all students and instructors. This will be used for
announcements by the instructors. It can also be used for class
discussions and questions.
Faculty
Mahadev Satyanarayanan (Satya)
Contact information: GHC 9123, x8-3743, satya@cs.cmu.edu
Admin assistant: Angela Miller (GHC 9129, x8-6645, amiller@cs.cmu.edu
Lab Instructor
Jan Harkes
Contact information: GHC 9127, x8-6658, jaharkes@cs.cmu.edu
Classes
Lecture/Lab: Mondays, Wednesdays and Fridays from January 13 to May 2
Time: 12:00 - 13:20
Place: GHC 4301
No class on
- Monday January 20 (Martin Luther King Day)
- Friday March 7 (Mid-semester break),
- Monday-Friday March 10-14 (Spring Break),
- Friday April 11 (Spring Carnival)
Exams
- Mid-term Exam: Wednesday, March 5 (in class)
- Final Exam: TBA (date and room)
Evaluation
A large part of your grade in the course will be based on the
projects. There will also be a mid-term exam and a final exam,
based on the material presented in the lectures and the required
readings. A small part of your grade will be based your active
engagement and participation in class (both lectures and lab).
The grade components will be weighted as follows:
- Projects: 50%
- Mid-term exam: 20%
- Final exam: 25%
- Class participation: 5%
Last edited by Satya (02/14/2014)