Elie Krevat

Department of Computer Science
Carnegie Mellon University
ekrevat at cs dot cmu dot edu

I'm currently working on the future of transportation with self-driving vehicles, developing distributed ML and deep learning systems for computer vision and autonomy. Previously, I was a Ph.D. student in computer science at Carnegie Mellon, researching many flavors of distributed systems, analytical modeling, applied machine learning, and large-scale data analysis. As a graduate research assistant in the Parallel Data Lab I was advised by Greg Ganger.

Research

Research interests include combining ML and systems techniques to create smarter, automated, and reactive systems. I'm excited about building tools that surface and learn from complex system relationships through large-scale data analysis. A great application of this is building out the autonomy platform for self-driving cars and trucks.

Distributed ML training and autonomy pipelines for self-driving vehicles

There are many unique challenges when it comes to processing a huge and rich corpus of logs to extract features and develop ML models for self-driving vehicles. It requires an efficient means for distributed training and scoring of models while analyzing their performance metrics on and off vehicle. This involves a combination of computer vision applications, classical ML methods, and deep learning.

Automated analysis and mitigation of performance problems in service-oriented architectures

Responding to resource-sensitive performance problems is becoming increasingly difficult for system administrators, and expensive in the amount of unnecessary overprovisioning, as distributed and cloud computing applications are built across larger numbers of interconnected shared services. Performance problems are common from many sources such as service upgrades and configuration errors, and the continuous flow of changing user requests that take different paths in the system. Unfortunately, root cause problem diagnosis efforts can take hours or days to isolate and fix the problem, even for system experts.

My dissertation proposes an automated approach to analyze and mitigate performance problems through the reactive provisioning of machines. This "quick fix" leverages end-to-end flow analysis to determine the critical path of requests, to help classify performance issues, and to direct an efficient allocation of resources to the services that affect client-perceived delays. These automated tools surface and learn from complex data relationships; they apply ML, data mining, and graphical and statistical analyses to predict and measure corrective actions. In many cases, problems can be mitigated in a few minutes after a problem is detected, returning client performance to acceptable levels and allowing any other problem diagnosis efforts to continue unconstrained.

Seeking Efficient Data-Intensive Computing

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. We developed simple models of I/O resource consumption and applied it to a map-reduce workload to produce ideal lower bounds on runtimes, exposing the inefficiency of popular scale-out systems. Using a simplified dataflow processing tool called Parallel DataSeries (PDS), we also demonstrated that the model's ideal can be approached within 20%, and explored the reasons for the gap between ideal and actual performance that are faced by any DISC system built atop standard OS and networking services. We found that disk stragglers and network slowdown effects are the primary culprits for lost efficiency.

Incast: TCP Throughput Collapse in Cluster-based Storage Systems

Building cluster-based storage systems using commodity TCP/IP and Ethernet networks is attractive because of their low cost, ease-of-use, and the desire to combine routing infrastructures for LAN, SAN, and high performance computing. Yet an important barrier to their use is the TCP Incast problem, where bursty traffic from synchronized reads in cluster-based storage systems produce a one to two order magnitude TCP throughput collapse. We have studied the network conditions that cause this TCP throughput collapse in both simulation and real-world deployments, examined the effectiveness of TCP- and Ethernet-level solutions, and with our latest publication we have found reasonable solutions to the problem with high resolution timers that implement a microsecond-granularity TCP retransmission timeout. This solution is both feasible and practical for fast storage networks while also safe for wide area networks, revisiting an older assumption on spurious TCP retransmissions that no longer holds true.

Publications

SpringFS: Bridging Agility and Performance in Elastic Distributed Storage.

Lianghong Xu, James Cipar, Elie Krevat, Alexey Tumanov, Nitin Gupta, Michael A. Kozuch, Gregory R. Ganger.

In 12th USENIX Conference on File and Storage Technologies (FAST 2014).

February 2014.

[pdf]
An Automated Approach for Mitigating Service Performance Problems with Efficient Resource Allocations.

Elie Krevat.

Carnegie Mellon University Ph.D. Thesis Proposal.

August 2013.

[pdf]
JackRabbit: Improved Agility in Elastic Distributed Storage.

James Cipar, Lianghong Xu, Elie Krevat, Alexey Tumanov, Nitin Gupta, Michael A. Kozuch, Gregory R. Ganger.

Carnegie Mellon University Technical Report.

October 2012.

[pdf]
Understanding Inefficiencies in Data-Intensive Computing

Elie Krevat, Tomer Shiran, Eric Anderson, Joseph Tucek, Jay J. Wylie, and Gregory R. Ganger.

Carnegie Mellon University Technical Report.

January 2012.

[pdf]
Applying Idealized Lower-bound Runtime Models to Understand Inefficiencies in Data-intensive Computing (Extended Abstract).

Elie Krevat, Tomer Shiran, Eric Anderson, Joseph Tucek, Jay J. Wylie, and Gregory R. Ganger.

In ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems.

June 2011.

[pdf]
Disks Are Like Snowflakes: No Two Are Alike.

Elie Krevat, Joseph Tucek, and Gregory R. Ganger.

In 13th Workshop on Hot Topics in Operating Systems (HotOS 2011)

May 2011.

[pdf]
Diagnosing Performance Changes by Comparing Request Flows.

Raja Sambasivan, Alice Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory Ganger.

In Proceedings of 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2011).

March 2011.

[pdf]
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication.

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat, David Andersen, Gregory Ganger, Garth Gibson, and Brian Mueller.

ACM SIGCOMM.

August 2009.

[pdf]
Tashi: Location-aware Cluster Management.

Michael Kozuch, Michael Ryan, Richard Gass, Steven Schlosser, David O’Hallaron, James Cipar, Elie Krevat, Julio López, Michael Stroucken, and Gregory Ganger.

In Proceedings of First Workshop on Automated Control for Datacenters and Clouds.

June 2009.

[pdf]
Measurement and Analysis of TCP Throughput Collapse in Cluster-Based Storage Systems.

Amar Phanishayee, Elie Krevat, Vijay Vasudevan, David Andersen, Gregory Ganger, Garth Gibson, and Srinivasan Seshan.

In Proceedings of File and Storage Technologies (FAST 2008).

February 2008.

[pdf]
On Application-level Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems.

Elie Krevat, Vijay Vasudevan, Amar Phanishayee, David Andersen, Gregory Ganger, Garth Gibson, and Srinivasan Seshan.

In Proceedings of Petascale Data Storage Workshop (PDSW at Supercomputing 2007).

November 2007.

[pdf] [ppt]
Scheduling Algorithms to Improve Utilization in Toroidal-Interconnected Systems.

Elie Krevat.

MIT Master of Engineering Thesis.

May 2003.

[pdf]
An Overview of the BlueGene/L Supercomputer.

NR Adiga et al. (large author list).

In ACM/IEEE conference on Supercomputing.

November 2002.

[pdf]
Job Scheduling for the BlueGene/L System.

Elie Krevat, Jose G. Castanos, and Jose E. Moreira.

In Job Scheduling Strategies for Parallel Processing, 8th International Workshop (JSSPP 2002).

July 2002.

[pdf]

Teaching

At CMU I TAed 15-712: Advanced Operating Systems and Distributed Systems with Dave Andersen and 15-213: Introduction to Computer Systems with Greg Ganger and Randy Bryant.

At MIT I TAed 6.033: Computer System Engineering with Frans Kaashoek while earning my M.Eng. degree.

Background

Before CMU, I completed a B.S. and M.Eng. in computer science at MIT, with a minor in economics. My master's thesis included work from a few summers and a semester of research at IBM T.J. Watson Research Center on system software for the Blue Gene supercomputer. I also spent a few years honing my product development experience at Microsoft as a software design engineer, where I played around with pre-alpha Windows technologies and developed the first two versions of Office Accounting Professional, a stand-alone product and third-party development platform for small businesses.