On Naming a Database Management System

I am exhausted. More so than normal. The main reason is that my beloved wife had a baby. After many questions, the paternity test confirmed that I am the biological father. And now I am hiding out at my house in my underwear because of COVID-19.

Despite the lockdown, my research group at Carnegie Mellon is also still working on our self-driving DBMS. This system is our second attempt at building a full-featured, Postgres-compatible DBMS. In 2018, we killed off Peloton and started over with a new codebase. See my previous post on the challenges of building a DBMS in academia. We hope to announce this new system by the end of this semester.

There are two reasons why I have not talked publicly about this system. The first is that we waiting until we get TPC-C results under a realistic operating environment [1]. The other reason is that we have struggled to come up with a good name. We do not want to reuse the name Peloton. That name is tainted. But coming up with better one to replace it is difficult.

There is the old quip that the only two hard problems in Computer Science are cache invalidation and naming. Indeed, there is an exhaustive amount of research on solving the caching problem in databases from the last 50 years. I want to discuss the naming problem instead and why it is especially hard for databases.

Background

I have worked on two DBMSs in my career: H-Store and Peloton.

I had nothing to do with naming H-Store since the project began before I started graduate school. This was the next DBMS that Stonebraker led after the influential C-Store system. The 'C' in C-Store meant 'column,' so the 'H' in H-Store meant 'horizontal,' since H-Store was a distributed, scale-out architecture. I did raise the point that there was already the hstore extension for Postgres and that we may want to consider changing the name to avoid confusion. That never happened.

After graduating, I moved to CMU and started building a new DBMS. This time I was in charge of the project since I was a professor, so I felt that it was my job to come up with a good name. My first Ph.D. student (Joy Arulraj) and I struggled with this for several weeks. Our original placeholder name in 2015 was N-Store since it was meant to be the non-volatile memory (NVM) database[2]. But I did not want to use the same "<Letter>-Store" convention that Stonebraker used. The reason was because it is not original or memorable. In addition to H-Store, I have worked on S-Store and E-Store. Others have created L-Store, G-Store, 3-Store, and 4-Store. There are also brick-and-mortar shops that use the same naming convention; a clothing store in Australia that used to own h-store.com offered to sell the domain name to me a few years ago.

The more pressing concern that I had with the name N-Store beyond its uninspiring nature is that it tied us too much to a single topic (NVM). As we attracted more students to work on the system, we realized that we wanted to explore other research areas. Thus, we needed a name that would not be limited in scope. One of my collaborators (the statuesque Todd Mowry) suggested the name Peloton since that was a bicycle racing term that he was familiar with from training for triathlons.

Let me say that Peloton was an excellent name for a database system back in 2016. It was before those other people came along with their gaudy overpriced exercise bikes. At first, it was not a big deal when they arrived on the scene. The CMU lawyers said that the overlapping names would not be a problem because we were in different fields. But then the Peloton bike company raised hundreds of millions in VC funding, and all of a sudden, their stores were everywhere. My DB friends from around the country would torment me with photos Peloton storefronts in malls and shopping districts. There was even a Peloton store outside of CockroachDB's headquarters in NYC when I visited them in summer 2019:

The Peloton store outside of CockroachDB's Manhattan office.

Given this and their gauche Stockholm Syndrome commercial from last year, we decided to ditch the Peloton name entirely when we started over with the new codebase. But again I did not want to make the same mistakes in picking a name that I made in the past.

Let's discuss what kind of names are out there and see if we can do better.

Common Database Suffixes

As of March 2020, I am aware of 700 different database systems in my encyclopedia of databases. One of the things that I have noticed from this list is the lack of imagination in names. Many projects use the same convention of "<Word><Suffix>" where the suffix is some term to indicate that it is a database. For example, there are many databases that use "DB" as their suffix (e.g., NutsDB, ArangoDB, RocksDB).

Here are the five mostly commonly used terms in database names:

Suffix Number of Databases
DB 255
SQL 51
Base 41
Graph 16
Store 14

As you can see, most systems go with the "DB" suffix, which makes sense. This is equivalent to operating systems using the suffix "OS" (e.g., TockOS, TempleOS, ReactOS) or file systems using the suffix "FS" (e.g., Btrfs, NTFS, OJSimpsonFS).

The standard formula for using a database suffix in a name is to pick one or two words at are emblematic of the main idea of the system as the prefix, then slap the suffix at the end of it. Some examples of this include how TabDB stores a database in browser tabs, BigchainDB is a blockchain database, IoTDB stores IoT data, and hipsterDB only supports data that is not "mainstream".

People will inevitably refer to your DBMS without the suffix. That is, instead of saying MongoDB, they will shorten the name to "Mongo." This idiom will occur unless the suffix is integral to the name. I have never heard anybody shorten MySQL or Sybase to "My" or "Sy," respectively. It is, therefore, important that if you go with the suffix approach in naming your DBMS, then the first words must not be commonly used for something else. That way, if people search for your DBMS without the suffix, then they will still be able to find it. For example, instead of Freedman et al. calling their system "TimeSeriesDB", they called it TimeScaleDB. This name was a smart move because if one searches for "timescale", then you will get their database.

If you use a common word in front of the suffix (e.g., BoltDB, PumpkinDB, ClearDB), then it is unlikely that your DBMS will be the first result for just that word. The one exception to this rule is Spark, but they are an outlier.

But I feel like the suffix approach is played out. This is not what I am looking for in a database name.

Systems With Similar Names

In addition using the same suffix, there are several DBMSs that have names that are too similar to each other. Here is a list of the ones that I am aware of:

The main issue with having a similar name with another DBMS is with search engine rankings. Some of these systems are small academic or hobby projects where having a resembling name does not matter that much. But many of these are VC-backed companies that need to attract users and customers.

Aside from potential confusion, a similar name diminishes the "coolness" factor of a DBMS. Avoiding this overlap is essential if you are marketing your DBMS to developers to promote greenfield adoption, as opposed to selling your DBMS to replace legacy systems in the enterprise market. With the former strategy, you need to understand that you are not just selling your database software, but that you are also selling a database lifestyle. What database somebody uses for their application says a lot about that person's values, political views, and marriage prospects.

Some systems have similar names to indicate their lineage or inspiration. For example, there are Redis clones that use the phrase "dis" in their names to indicate that they are compatible with Redis (e.g., Edis, LedisDB, Pydis).

What should you do if your DBMS's name turns out to be too similar to somebody else? The good news is that history indicates that if you change names early enough in the life of the project, then the switch does not negatively affect the system's adoption. There are several examples of this. The original name of Stonebraker's commercial version of Postgres in the 1980s was Miro. They then switched to Montage due to trademark disputes over this first name. But then there was another dispute and the final name was Illustra [3]. Stonebraker used this approach of adding an 'A' to the end of a word again when naming Vertica. The first name of the commercial version of H-Store was also Horizontica, but it was changed to VoltDB (i.e., the Vertica On-Line Transaction Database).

More recent DBMS name change examples include how Aerospike used to be called Citrusleaf, but then it was considered to be too similar to Citus. NuoDB used to be called NimbusDB (which is a name that I liked), but they switched to avoid a trademark dispute with another non-database cloud computing vendor. MapD was renamed to OmniSci after MapR started barking at them.

The Pavlo Database Naming Method

After years of careful study of this problem, I believe that I have devised a method for successfully naming a DBMS if one does not want to use the suffix approach. To understand what this strategy is, we first need to look at some examples. In my opinion, the best DBMS names from the last thirty years are Postgres[4] and Clickhouse.

These two names did not mean anything before, but now they only have one connotation. There is no ambiguity. There is no overlap with other DBMS names or non-database entities. They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."

After reflecting on what makes them so good, I realized what the secret was to them. They are a two-syllable name that is derived from combining two unrelated one-syllable words together (e.g., Post + Gres, Click + House). Each individual word has its own meaning. It is only when you put them together does it mean the database.

Given this, I henceforth contend that the best way to name a DBMS is to combine two one-syllable words. This naming strategy is now known as the Pavlo Database Naming Method™.

Honorable Mentions

There are several DBMSs that follow my naming method that are worth mentioning: Dydra, GridGain, QuickStep, and Xeround (RIP). Redis is also a great name but it does not follow my strategy exactly. Yes, it is a two-syllable name that is unique. But rather than being words put together, it is an abbreviation (Remote Dictionary Server). Citus and Hadapt are similar to Redis, but I think they are too prone to mispellings.

These are good examples of names that I like, but they are just not as lit as Postgres or Clickhouse.

Caveats

The naming method only works if the combined word does not already mean something else. For example, Lovefield is both a database and an airport in Dallas established in 1917. The GemFire database project started in the late 2000s, but there is also a video game from 1991 with the same name. One could argue that Postgres is too similar to Ingres. But none of my students have ever heard of Ingres before I told them about it.

Using a color in the name technically follows my convention (e.g., Redshift, Greenplum, BlueFlood, BlackRay), but I am less enthusiatic about them. The names overlap with too many other things (e.g., there are edible fruit plums that are green).

Note also that I am approaching this naming problem from an academic standpoint. I want to recruit students to work in my research group. I want people to read my publications. I do not need to worry about customers because I am not up for tenure until 2022. Thus, my naming method might not apply to commercial systems. If you are business, sometimes being direct with your name is a good idea. MemSQL is a good example. The name tells you everything you need to know: they are an in-memory DBMS that supports SQL. Similarly, MemGraph is another satisfactory and safe name for a commercial DBMS.

Applying the Pavlo Method

We set out to put our method into practice. As I mentioned above, we started over with a new DBMS, so we wanted to have a new name. The first name that we came up with was PoopDish. This name is obviously hilarious, and thus it was the original name of our system for a brief period. After sobering up, we realized that this (might) be a bad idea. Thus, we had to abandon the name. The only trace of this first name is this commit that switches the DBMS's name to be temporarily my dog's name:

The infamous "PoopDish" revert commit.

Rather than everyone coming up with random names on their own, my plucky student (Wan Shen Lim) then wrote a script that would generate a new potential DBMS name by combining two one-syllable words. It then posted it each morning on Slack to our research group.

One of the first names that Wan's script generated that we liked was BusTub. This name is beautiful and satisfies my above criteria. It is unique. It is whimsical. And most importantly, it clearly means "databases." We liked the name so much that we used it for the educational DBMS that we employ in CMU's introduction database course. We even hired a professional designer to make us a logo that looks like CMU's campus shuttles:

Carnegie Mellon's BusTub Database Management System.

But BusTub is not replacing Peloton. It serves a different purpose. We do have a name for the new DBMS. We have published some minor papers that do mention it, but we are not ready to announce it yet. More on that later...

Footnotes

  1. By "realistic" environment, I mean (1) executing client-side transactions that invoke queries over JDBC (i.e., no stored procedures) and (2) running the DBMS with write-ahead logging writing records to an SSD (i.e., no ramdisk or mythical PM hardware). Many academic papers make these assumptions.
  2. Joy won the 2019 SIGMOD Best Dissertation Award for his work on NVM databases.
  3. See this tribute video put together by Gary Morgenthaler that discusses the history of Illustra from Stonebraker's 70th Birthday Celebration.
  4. I acknowledge that the official name is PostgreSQL. The original name was Postgres as this was the second DBMS that Stonebraker developed after INGRES. So it was "Post" + "Ingres" → Postgres.
  5. I occasionally watch DBDB.io's autocomplete log to see what people search for on the site. I consistently see everyone spelling Postgres, Clickhouse, and Redis correctly on the first try. This is not a scientific observation, but my guess is that YugaByte is mispelled the most often (e.g., "HugaByte", "JugaByte"). I also see several searches for "SQL Light".