Clean Slate Architectures for Network Management

While Internet Protocol (IP) has been a runaway success, today's IP networks are difficult to manage well. We take a clean slate approach for redesiging different aspects of network control and management, guided by the following three principles:

Network-level objectives: Running a robust data network depends on satisfying objectives for performance, reliability, and policy that can (and should) be expressed as goals for the entire network, separately from the low-level network elements.

Network-wide views: Timely, accurate, network-wide views of topology, traffic, and events are crucial for running a robust network.

Direct control: The decision logic should provide network operators with a direct interface to configure network elements; this logic should not be implicitly or explicitly hardwired in protocols distributed among switches.

These design principles have been embodied in three research initiatives:

An architecture for centralizing network decision logic
The theory and practice of interconnecing multiple routing instances
The design of new flow monitoring solutions

The 4-D Architecture

Layers of the 4D architecture

Despite the early design goal of minimizing the state in network elements, tremendous amounts of state are distributed across routers and management platforms in IP networks. We believe that the many, loosely-coordinated actors that create and manipulate the distributed state introduce substantial complexity that makes both backbone and enterprise networks increasingly fragile and difficult to manage. In the 4D architecture, we decompose the functions of network control into 4 planes: A decision plane that is responsible for creating a network configuration (e.g. computing FIBs for each router in the network); a dissemination plane that gathers information about network state (e.g. link up/down information) to the decision plane, and distributes decision plane output to routers; a discovery plane that enables devices to discover their directly connected neighbors; and a data plane for forwarding network traffic.

Publications

Tesseract: A 4D Network Control Plane
Hong Yan, David A. Maltz, T. S. Eugene Ng, Hemant Gogineni, Hui Zhang, Zheng Cai.
Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI '07), April 2007.
A Clean Slate 4D Approach to Network Control and Management
Albert Greenberg, Gisli Hjalmtysson, David A. Maltz, Andy Myers, Jennifer Rexford, Geoffrey Xie, Hong Yan, Jibin Zhan, Hui Zhang.
ACM SIGCOMM Computer Communication Review. 35(5). October, 2005.
Network-Wide Decision Making: Toward A Wafer-Thin Control Plane
Jennifer Rexford, Albert Greenberg, Gisli Hjalmtysson, David A. Maltz, Andy Myers, Geoffrey Xie, Jibin Zhan, and Hui Zhang.
Proceedings of HotNets III. November, 2004.
Refactoring Network Control and Management: A Case for the 4D Architecture
Albert Greenberg, Gisli Hjalmtysson, David A. Maltz, Andy Myers, Jennifer Rexford, Geoffrey Xie, Hong Yan, Jibin Zhan, Hui Zhang.
CMU CS Technical Report CMU-CS-05-117, September 2005.
On Static Reachability Analysis of IP Networks
Geoffrey Xie, Jibin Zhan, Dave Maltz, Hui Zhang, Albert Greenberg, Gisli Hjalmtysson, Jennifer Rexford.
Proceedings of IEEE Infocom 2005.
Routing Design in Operational Networks:A Look from the Inside
D. Maltz, G. Xie, J. Zhan, H. Zhang, A. Greenberg, G. Hjalmtysson.
Proceedings of ACM SIGCOMM 2004.

Theory and Practice of Interconnecting Multiple Routing Instances

Today, a large body of research exists on the correctness of existing routing protocols. However, analytical frameworks for studying routing dynamics have mostly focused on one single routing protocol instance at a time. In reality, the Internet is composed of, not one (e.g., BGP) but, a multitude of protocol instances that need to interact. For example, routes must be exchanged between BGP and OSPF. The interactions between these protocol instances are governed by the routing glue component. However, despite its wide usage and essential role, there has been no formal investigation into how safe its usage is. We develop analytical models to rigorously analyze the interactions between multiple routing protocol instances, and its impacts on a network-wide level. We show that making routing protocols safe alone is not sufficient to ensure the correctness of Internet routing but the routing glue plays an equally important part: Its usage can result in a wide range of routing anomalies including persistent forwarding loops and permanent route oscillations. This routing glue deserves further attention from the networking community.

Publications:

Instability Free Routing: Beyond One Protocol Instance
Franck Le, Geoffrey Xie, Hui Zhang.
Proceedings of ACM CoNEXT '08, December 2008.
Brief description

The interactions between routing protocol instances are in fact governed by two procedures: route redistribution permits the exchange of routing information, and route selection allows routers to rank routes received from different instances. We demonstrate that the problem is broader than that of route redistribution alone. Route selection by itself, i.e., the mere co-existence of multiple routing protocol instances, and its interplay with route redistribution can each result in routing anomalies. We show that the routing glue could actually be at the origins of many global disruptions of the Internet connectivity that have been reported but could not be fully explained so far.
Shedding Light on the Glue Logic of the Internet Routing Architecture (Slides)
Franck Le, Geoffrey Xie, Dan Pei, Jia Wang, Hui Zhang.
Proceedings of ACM SIGCOMM '08, August 2008.
Brief description

We conduct a large-scale empirical study of the prevalence and usage of the routing glue in more than 1600 operational networks. The evidence show that the routing glue is widely deployed. More surprisingly, we discover that operators depend on the routing glue not simply to interconnect routing instances but also to implement complex design objectives that existing routing protocols (e.g., BGP) alone cannot accomplish. This reinforces the importance of the role played by the routing glue. Finally, we find that actual deployed configurations can be vulnerable to routing anomalies. These results confirm the importance of the problem.
Understanding Route Redistribution (Slides)
Franck Le, Geoffrey Xie, Hui Zhang.
Proceedings of IEEE ICNP '07, October 2007.
Best Paper Award.
Brief description

We develop an analytical model to rigorously analyze the impacts of route redistribution, i.e., the exchange of routing information between different routing protocol instances, on a network-wide level. We illustrate how easily inaccurate configurations of route redistribution may cause severe routing instabilities (including route oscillations and persistent routing loops) and we discuss potential changes to the current route redistribution procedure to guarantee safety.
On Guidelines for Safe Route Redistributions
Franck Le, Geoffrey Xie.
Proceedings of ACM SIGCOMM Workshop on Internet Network Management (INM'07), August 2007.
Brief description

We show that existing recommendations put forth by router vendors do not effectively protect against routing anomalies. Configurations of route redistribution, compliant with existing guidelines, can still experience permanent route oscillations and other unacceptable instabilities. Consequently, we propose a set of new configuration guidelines for different targeted objectives. The configuration guidelines consist of sufficient conditions for the usage of route redistribution, and we formally prove that each guideline will prevent the targeted routing anomaly.

Technical Reports:

Theory and New Primitives for Interconnecting Routing Protocol Instances
Franck Le, Geoffrey Xie, Hui Zhang.
Computer Science Technical Report CMU-CS-09-132, May 2009.

Rethinking Flow Monitoring: A Coordinated RISC Architecture for Network Flow Monitoring


RISC vs. application-specific approaches	Example of a network-wide RISC approach

Flow monitoring supports several critical network management tasks such as traffic engineering, accounting, anomaly detection, identifying and understanding end-user applications, understanding traffic structure at various granularities, detecting worms, scans, and botnet activities, and forensic analysis. These require high-fidelity estimates of traffic metrics relevant to each application. The set of network management and security applications is a moving target, and new applications arise as the nature of both normal and anomalous traffic patterns changes over time. We make the case for a "RISC" approach for flow monitoring which employs simple collection primitives on each monitoring device and manages them in an intelligent network-wide fashion, to ensure that the collected data will support computation of metrics of interest to various applications. A RISC architecture dramatically reduces the implementation complexity of monitoring elements; enables router vendors and researchers to focus their energies on building efficiently implementing a small number of primitives; and allows late binding to what traffic metrics are important, thus insulating router implementations from the changing needs of flow monitoring applications.

Presentation

Rethinking NetFlow

Publications

A Case for a RISC Architecture for Network Flow Monitoring
Vyas Sekar, Michael K Reiter, Hui Zhang,
CMU CS Technical Report CMU-CS-09-125
Brief description

This paper addresses the question of whether we need complex application-specific primitives to meet the demands of different flow monitoring applications or if it suffices to implement a small number of "RISC" primitives on routers to get sufficient fidelity across the entire spectrum of applications.
Coordinated Sampling sans Origin-Destination Identifiers: Algorithms, Analysis, and Evaluation
Vyas Sekar, Anupam Gupta, Michael K Reiter, Hui Zhang,
CMU CS Technical Report CMU-CS-09-104
Brief description

This paper describes how to implement cSamp using only local information on routers without requiring global OD-pair identifiers. It provides an immediate and incremental deployment path for ISPs without requiring changes to packet headers or the existing routing infrastructure.
cSamp: A System for Network-Wide Flow Monitoring
Vyas Sekar, Michael K. Reiter, Walter Willinger, Hui Zhang, Ramana Rao Kompella, David G. Andersen
Proceedings of NSDI 2008
Brief description

This paper describes the basic Coordinated Sampling framework. There are three key ideas: flow sampling, hash-based coordination, and an optimization framework for meeting network-wide flow monitoring objectives while operating within router resource constraints.