TL;DR
- Yes, it is possible to build a new DBMS in academia. It's still hard. We found that using Postgres as a starting point was too slow for in-memory OLTP. The challenges in academia are slightly different than with a start-up.
I've been on my grind the last couple of months. We had a successful demo of our system at CIDR in January. Somebody started a feud with me. I'm looking past that and trying to maintain my flow.
Like other schools with large CS programs, CMU has a long-tradition of big software system projects that go on to have a life outside of the university (e.g., Mach, AFS). To the best of my knowledge, there has not been a full-featured DBMS that has come out of CMU[1]. The closest would be Natassa Ailamaki's Shore-MT project from when she was at CMU, but it is only an execution engine and storage manager (think InnoDB).
Although I am not ready to officially "announce" our DBMS yet, I want to talk about what my group has been working on for the last two years. This is also how I plan to spend the next five years[2] of my life building this system. The problem with building an open-source DBMS is that the bar is high because there are already great existing systems (e.g., MySQL, Postgres). I want to avoid harsh initial reactions from making grandiose claims about its ability. The system's self-driving components are going to take a while (i.e., years) to get the research right.
Strategies for Starting a New DBMS Project
Few take on the task of writing a DBMS from scratch. Part of the reason is that there is just so much infrastructure that you need to have in order to create a system that is usable. Such components include a SQL parser, query optimizer, system catalogs, configuration files, networking support, and expression + type systems. This is even before you get to the other hard parts like concurrency control, query execution, and storage/recovery. I refer to all of the items in the first list as the "front-end" of a system. This is not sexy code to write. This is why most new DBMSs start with building on Postgres[3] by either forking it (e.g., Greenplum, Vertica, PipelineDB) or using its extension/hook APIs (e.g., CitusDB, Vitesse, TimescaleDB). One could also try to create a new storage engine for MySQL, but these projects are notoriously never successful[4] (see Falcon, DeepSQL, Akiban, InfiniDB). Another advantage of basing your new system off of an existing one is that you get to retain compatibility with some of the existing tools in a DBMS's ecosystem.
When we started the Peloton project we decided to fork Postgres and then cut out the parts that we wanted to rewrite. Postgres' code is beautiful. It's well-documented. It's portable. It's a textbook implementation of a relational DBMS. But it is a bit dated and the overall architecture has some issues. The first problem that we encountered was that we had to convert it from ANSI-95 C to modern C++11 to make it work with our new storage manager. My PhD student Joy Arulraj did this with some summer interns in about a month (see his C++ Postgres fork on Github). We then spent another month converting its runtime architecture from a multi-process, shared-memory model to a single-process, multi-threaded model. We deemed that this was necessary to support better single-node scale-up now and eventually go distributed in the future. One surprising thing that we found was that using Postgres' WIN32 code is easier to convert to pthreads than the Linux-specific portions of the code.
At this point we had a functioning DBMS that could convert Postgres query plans into our system's plans and then execute them on our back-end engine. Our testing, however, showed that this conversion from the Postgres world to our system was a bottleneck when we tried to increase the number of concurrent transactions. For an OLAP system, using the Postgres front-end code is fine. This is because the system will spend most of its time processing long-running queries and not in the front-end. But if you want to build a DBMS that supports highly concurrent workloads with an in-memory database, then all of this legacy code is too slow.
We therefore tried to take out the slow parts from the front-end and keep as much as possible. But the problem with this approach is it is not easy to pull out Postgres' individual components. For example, you can't just take out the query optimizer without bringing along the catalogs. If you take the catalog, then you have to include the storage manager, which then includes the lock manager, memory manager, and so on. This is well-known problem in database systems and was the motivation of the "RISC-style" architectures. The SIGMOD Programming Contest originally started as a way to build up a repository of DBMS components that one could glue together to make a real system. There are also other attempts at making standalone components. For example, there are several query optimizer frameworks (Orca) and generators (Columbia, Opt++).
After a few weeks of debating our options, we concluded that it would probably be the same amount of work to integrate these components into our system as it would be to just write our own (especially since some of these components are over 20 years old). Using them would also mean that there would be large parts of the system that we didn't have a complete understanding of and therefore it would be difficult to control when we start integrating the self-driving parts.
In 2016, we decided to bite the bullet and remove the rest of Postgres from our system. This means that we are re-writing all of the front-end components ourselves.
Research Philosophy
You may be asking what does everything I've written so far have to do with research? One could argue that our journey just sounds like the growing pains of a start-up company building a DBMS. Furthermore, if I am claiming that the unique characteristic of our new DBMS is that it is autonomous, why do I care whether it is not as fast as it could be if I kept the Postgres front-end? Why bother with building (yet) another DBMS?
Mike Stonebraker taught me that the best way to have the most impact in database research is to build a system that solves real-world problems for people. One could just write papers and then hope that somebody else picks up your ideas and implements them. I think that this is a rare occurrence relative to the number of academic papers that are published. It does happen though (see Dan Abadi and Daniel Lemire). Now for Mike, the best way to do this is to build an academic prototype, publish a paper, and then form a company that builds the real system. That's a bit risky for someone without tenure, but it can be done (Dan Abadi strikes again).
The way that my research group is designing and building our DBMS is through a step-by-step process. For every part of the system that we have to build, we will try to write a research paper about our approach. My inspiration for this comes from the HyPer DBMS project from the TUM DB Group. They have a lot of papers that describe the different parts of their system and all the problems that they had to solve. But I want to go a little deeper than just writing about only what we did. We will examine all of the state-of-the-art implementations in both academia and industry. This is through a combination of reading papers as well as reaching out to friends in industry and asking them how they implement certain things. We then perform experiments that evaluate the different approaches in Peloton and under a variety of workload scenarios. Then whatever one is the best is what we will use in our system.
One benefit of this style of research is that it makes us extremely unbiased. That is, unlike other papers that compare their new algorithm or method to other state-of-the-art implementations, we don't have any preference. We will take whatever just works. I will also add that we are doing more than just running existing implementations in a single platform; we also spend time to understand the algorithms carefully and come up with new techniques for overcoming bottlenecks. It does, however, mean that that it takes longer for us to run the experiments because we have to keep tweaking the system.
As an example of this, there are several papers and articles published in recent years on high-performance multi-version concurrency control (MVCC). These include the protocols used in Microsoft Hekaton, HyPer, and MemSQL. The problem, however, is that these articles don't always explain the reasoning behind why the developers made certain design decisions in their DBMS. Furthermore, there are also other aspects of the system that are highly dependent on each other that are not evaluated thoroughly. I had a visiting PhD student (Yingjun Wu) implement all of them with some MS students and then we did a bake-off. I will discuss our recently published MVCC study in my next blog post.
And the reason why we are spending this effort to make this system as fast as possible is to ensure that it is robust and nimble. This will allow the self-driving "brain" to quickly try out different configurations and optimizations, and then (hopefully) immediately get feedback as to whether its decisions were correct.
Building a DBMS is Hard. Building One in Academia is Even Harder
I recognize that the above agenda sounds too idealistic and thus some of you may think that such an approach to DBMS development is only possible in academia. I concede that academia provides the freedom for one to be more methodical about designing the system than what may be possible at a start-up or a large company. But it does not necessarily mean that our system is going to be better or that our job is somehow easier. Although I don't have (direct) financial pressures, building a new DBMS is still a daunting challenge. There are other problems that I have to deal with that a company does not. I don't get to spend all my time working on the system because I have to teach, write grant proposals, serve on program committees, write papers, and teach my students to survive on the streets.
Developer Continuity
Every tech company struggles with attracting good talent (except for possibly Facebook). I don't really have that problem since the admissions process sort of takes care of that for me. My students at CMU are awesome but they are only with me for a short amount of time. For undergrads and MS students, it's usually about two semesters. New students often lack knowledge about DBMS internals (that's okay, they're here to learn); it usually takes them about a semester to pick this up either from working on the system or by taking my advanced DB class. PhD students obviously stick around for longer but they are in graduate school to do research. I have to balance their time spent writing papers versus writing new features in the system. One thing that I require is that (almost) all of the code that they write for experiments has to be merged into the DBMS's master branch after they submit the paper for review.
I will admit that I haven't quite figured out how to manage this perpetual revolving door for my team. Obviously writing a lot of tests and documentation is important, but this doesn't always happen. One way to solve this problem is to hire a full-time systems programmer, but it's hard to raise money to do this as a new professor. You can't use a regular NSF research grant to pay for a programmer. Instead there are infrastructure grants.
Lack of Workloads & Testing Tools
Every DBMS company creates their own framework and infrastructure for testing their DBMS. The workloads in these test suites are collected from users over time. But when you first start building a DBMS in academia, you do not have users. Thus, the only way to get real workloads is by signing NDAs. This is the standard practice in industry (often as part of a POC). I try to avoid NDAs for both myself and my students. They are always a huge time sink because things get messy when university lawyers start talking with company lawyers. They also make it tricky for us to discuss and share our research findings with others.
If you don't have customer workloads, then the next best option are synthetic workloads. The good news here is that there are several database benchmarks available. Most of them are specific to one system (pgbench, sysbench) and contain only one workloads (usually some variant of TPC-C). Our OLTP-Bench framework does include multiple workloads but it is meant for benchmarking performance. We have not found a tool that can provide comprehensive SQL testing. SQLsmith is close to what we want but you have to provide your own schema. There are a lot of interesting ideas in SIGMOD's now defunct DBTest workshop, but there is little to no code available.
We are working on a new SQL test suite that we will release later in the year. It will use JDBC and Postgres' SQL AST so it should be compatible with several other systems.
Hardware
Getting hardware for testing and benchmarking is not a problem like it used to be in previous decades. Computers are cheap. Every student also has their own laptop. This means that we only have to provide the students with monitors and keyboards at each desk in our lab. We then have some additional server machines for benchmarking. The one lesson that I have learned since starting the project is that a large percentage of students use a Mac laptop. This means that it is important that your DBMS can build on OSX even if you only intend to deploy it on Linux. We currently don't support OSX and instead provide a Vagrant file. Students tell me that they prefer developing directly in OSX. I should have made sure that we support OSX from the beginning of the project.
If you want to get exotic hardware (e.g., Infiniband, NVM), then you either have to write a simulator (bleh) or you have to know somebody at the company to ask for a donation. DBMS developers at the major companies get access to all kinds of new hardware earlier than the average person so that they they have time to make sure that their system is compatible (again, NDAs help with this).
Other Academic Database System Projects
I now want to briefly mention some of the other large-scale, multi-year DBMS projects that are going on right now in academia. The most famous academic DBMSs are Stonebraker's Ingres and Postgres projects. Mike always likes to say that you can build the first 90% of a DBMS in academia in just a few years, but then to build the remaining 10% of the system you need $5 million and several years. The next most famous is MonetDB by Peter Boncz and Martin Kersten. Peter then later built the X100 prototype based on his experiences with MonetDB that formed the basis of VectorWise[5]. Lastly, there is the previously stated TUM HyPer DBMS. I consider HyPer to be one the state-of-the-art DBMS implementations right now.
There are a couple other academia-only DBMS projects that I am aware of right now. These are systems that are designed to support real-world applications and usecases, so I am excluding projects like Silo that are only academic prototypes. I am also excluding systems like Shark, BlinkDB, and H-Store where there are now companies actively involved in the commercial versions.
Note also that none of these projects listed below are based on Postgres. The last major academic system that I know that used Postgres was Berkeley's TelegraphCQ (which later became Truviso).
- Wisconsin Quickstep — This is an in-memory database engine that supports hybrid data storage and fast vectorized execution by Jignesh Patel. Quickstep was acquired by Pivotal in 2015 and then became an Apache incubator project in 2016.
- Stanford DeepDive — This is a DBMS for extracting structured features from unstructured or semi-structured databases. DeepDive is different than traditional systems because it is about writing queries that generate new data rather than ingesting/processing existing data. It has been used for several high-profile projects that have had significant real world-impact. Chris Re is awesome.
- Washington Myria — Myria is a distributed, shared-nothing DBMS designed to run in the cloud as a service (but it can also run on premise). The system was originally developed by the UW DB Group to support scientists at their university. It is written from scratch entirely in Java. It uses a extensible query algebra that supports multiple data models.
- UCI AsterixDB — This is another academic project that joined the Apache Software Foundation. It is a distributed DBMS for the Hadoop ecosystem started by Mike Carey. Couchbase announced in 2015 that they were using AsterixDB's SQL++ language as the basis for their new N1QL language (think SQL that supports JSON).
- Stanford MacroBase — This is a newer system started last year by Peter Bailis (aka the "Thelonious Monk of Databases"). Like DeepDive, MacroBase is not a full DBMS but rather an execution engine for anomaly detection queries.
Footnotes
- It is true that Transarc was a CMU-spinoff company that had a TP monitor, but this was for their application server and it was not a full-fledged DBMS (i.e., it did not store data).
- CMU has a nine-year tenure clock. They told me when I started that it's this long so that professors could pursue big ideas. I'm not sure I believe that.
- I note that Facebook's MyRocks engine is looking to be a serious contender against InnoDB.
- At Stonebraker's 70th Birthday Party, Joe Hellerstein came up with a list of all of the Postgres derivates. You can watch the video here (skip to 02:19:40).
- VectorWise was bought by the Ingres Corporation in 2011. It has since been renamed to "Actian Vector" and unfortunately it appears to be relegated to maintence mode. We have benchmarked VectorWise in some experiments. It's really good.