Like a shot to your dome piece, I'm back to hit you with my annual roundup of what happened in the rumble-tumble game of databases. Yes, I used to write this article on the OtterTune blog, but the company is dead (RIP). I'm doing this joint on my professor blog.
There is much to cover from the past year, from 10-figure acquisitions, vendors running wild in the streets with license changes, and the most famous database octogenarian splashing cash to recruit a college quarterback to impress his new dimepiece.
I promised my first wife that I would write more professionally this year. I have also been informed that some universities assign my annual blog articles as required reading in their database courses. Let's see how it goes.
Previous entries:
- Databases in 2023: A Year in Review
- Databases in 2022: A Year in Review
- Databases in 2021: A Year in Review
It’s My Database and I’m Going to License it the Way I Want!
We live in the golden era of databases. There are many excellent (relational) choices for all types of application domains. Many are open-source despite being built by for-profit companies backed by VC money.
But VCs want that money back and their trap full, so these companies turn out a hosted service for their DBMSs on the cloud. But the cloud makes open-source DBMSs a tricky business. If a system becomes too popular, then a cloud vendor (like Amazon) slaps it up as a service and makes more money than the company paying for the development of the software. This threat is why many database companies change to more restrictive source code licenses that protect against cloud vendors from reselling their products. MongoDB was among the first to do this in 2018 when they switched to their Server Side Public License (SSPL).
This past year was a turbulent one for license changes, and the most prominent two were Redis™ and Elasticsearch.
Redis:
Redis Ltd. (the company) is on an aggressive path towards its IPO. Originally starting as Redis Labs in 2011, they switched their name to Redis Ltd. in 2021 when they acquired the Redis trademark from its creator (Salvatore Sanfilippo), who Redis Labs had bankrolled. Over the last few years, Redis Ltd. has attempted to consolidate control over the Redis landscape. The company has also tried to cast off the perception that the system is primarily used as an in-memory cache by adding support for vectors and other data models.
In March 2024, Redis Ltd. announced they were switching from the system's original (very permissive) BSD-3 license to a dual license comprising the proprietary Redis Source Available License and MongoDB's SSPL. The company announced this change the same day they announced the acquisition of Speedb, an open-source fork of RocksDB.
The backlash to the Redis license move was quick. The same week the license changed, two forks were announced based on the original BSD-3 code line: Valkey and Redict. Valkey started at Amazon, but engineers at Google and Oracle quickly joined. In only one week, the Valkey project clapped back at Redis Ltd. when it became part of the Linux Foundation, and several major companies shifted their development efforts to it. Redis Ltd. did not help their perception of being up to no good when they got frisky with their beloved trademark and started taking over open-source Redis extensions.
In an obvious homage to when Bushwick Bill (RIP), Scarface, and Willie D got back together in 2015, the Redis' creator announced in December 2024 that he is in touch with the Redis Ltd. management and is looking to make a comeback to reunite the Redis community.
Elasticsearch:
Elastic N.V. is the for-profit company backing the development of the leading text-search Elasticsearch DBMS. In 2021, they announced their switch to a dual-license model of the Elastic License and MongoDB's SSPL. Again, this was in response to the rising prominence of Amazon's Elasticsearch offering, even though the service had been out since 2015. Amazon didn't take this switch kindly and announced their OpenSearch fork.
Three years later, Elastic N.V. announced in August 2024 that they reverted their license change and switched to the AGPL. Their blog article announcing this change references Kendrick Lamar's songs (e.g., Not Like Us). Amazon did not like being called the Drake of Databases and announced the following month that they were transferring ownership of the OpenSearch project to the Linux Foundation.
Andy’s Take:
This turmoil seems like a lot just over licenses, but remember, there is big money in databases. And this is just two systems! I didn't even discuss Greenplum quietly killing off their open-source repository after nine years and going proprietary. But people did not notice because nobody willingly runs Greenplum anymore. The only DBMS I know of that made the same open-source reversal is Altibase in 2023.
I'll be blunt: I don't care for Redis. It is slow, it has fake transactions, and its query syntax is a freakshow. Our experiments at CMU found Dragonfly to have much more impressive performance numbers (even with a single CPU core). In my database course, I use the Redis query language as an example of what not to do. Nevertheless, I am sympathetic to Redis Ltd.'s plight of being overrun by Amazon. However, the company is overestimating the barrier of entry to build a simplistic system like Redis; it is much lower than building a full-featured DBMS (e.g., Postgres), so there are several alternatives to the OG Redis. They are not in position of strength where such posturing will be tolerated by the community.
The Elasticsearch saga is the same story as Redis, except they are further along in the plot: the company announces license change, competitors create an open-source fork, and the company reverts to an open-source license to muted fanfare.
Notice that Redis and Elasticsearch are receiving more backlash compared to other systems that made similar moves. There was no major effort to fork off MongoDB, Neo4j, Kafka, or CockroachDB when they announced their license changes. CockroachDB even changed its license again in 2024 to make larger enterprises start paying up. It cannot be because the Redis and Elasticsearch install base is so much larger than these other systems, and therefore, there were more people upset by the change since the number of MongoDB and Kafka installations was equally as large when they switched their licenses. In the case of Redis, I can only think that people perceive Redis Ltd. as unfairly profiting off others' work since the company's founders were not the system's original creators. An analysis of Redis' source code repository also shows that a sizable percentage of contributions to the DBMS comes from outside the company (e.g., Tencent, Alibaba). This "stolen valor" was the reason for the ire HashiCorp received when they changed Terraform's license in 2023.
The overarching issue in these license shuffles is the long-term viability of open-source independent software vendors (ISVs) in the database market. The cloud vendors are behemoths with infinite money. If an open-source DBMS takes off, they will start hosting it and make more money than the ISV. Or they will add your DBMS's wire protocol as a front-end to an existing DBMS, like when AWS added InfluxDB v2 protocol support for their Timestream DBMS in March 2024. They can then be the girlfriend of the aforementioned Bushwick Bill and shoot you in the eye, like when AWS announced that their new Valkey-compatible services are 30% cheaper than their Redis-compatible services.
The Databricks vs. Snowflake Gangwar Continues
There continues to be no love lost between Databricks and Snowflake. This fight is a classic database war that has spilled out into the streets. The two companies’ previous feud over query performance has expanded to other areas of data management and become much more expensive.
Databricks took the first shot in March 2024 when they announced they spent $10 million to build their DBRX open-source LLM with 132 billion parameters. The Mosaic team led the development of the DBRX model, which Databricks acquired in 2023 for $1.3 billion. One month later, Snowflake rolled up to the same corner and lit it up using their Arctic open-source LLM with 480 billion parameters. Snowflake boasted they only spent $2 million to train their model while outperforming DBRX for “enterprise” tasks like SQL generation. You can tell that Snowflake cares most about taking a shot at Databricks because their announcement shows other LLMs doing better than them (e.g., Llama3), but they highlight how they are better than DBRX. One AI researcher was confused about why Snowflake focused so much on DBRX in their analysis and not the other models; this person does not know how much blood these two database rivals have spilled.
While the public LLM battle raged, another front in the war between Databricks and Snowflake opened up behind the scenes over catalogs. For most of the 2010s, Hive’s HCatalog has been the de facto catalog system on data lakes. Iceberg and Hudi emerged as replacements in the late 2010s from Netflix and Uber, respectively, and both became top-level Apache projects backed by VC-funded startups. These systems provide a meta-data service to track files and support transactional ingestion of new data on an object store (e.g., S3). Databricks has a proprietary catalog service called Unity, which works with its DeltaLake platform. Snowflake announced their initial integration with Iceberg-backed tables in 2022. They then expanded their support for Iceberg over the next few years. Then, they looked into acquiring Tabular, the main company behind Iceberg, to compete against Databricks with Unity and DeltaLake. The story goes that Snowflake was about to close the deal for $600 million. But then Databricks crashed the party and splashed $2 billion to acquire Tabular. Databricks announced the acquisition the same day as the Snowflake CEO’s conference keynote address, where he was announcing their new open-source Polaris catalog service in June 2024. Databricks continued to kick Snowflake in the teeth when they announced they were open-sourcing their Unity catalog the following week. Straight murdergram.
Andy’s Take:
What is interesting about this database battle is that it is not just about raw performance numbers. It isn’t like the old Oracle versus Informix shootout of the 1990s, where they were mostly boasting about faster query latencies. It is true that the battle also went beyond just benchmark shots when Informix sued Oracle (and later had to withdraw their suit) because Oracle poached some top Informix executives. Later, the world found out that the Informix CEO had cooked the company’s books to inflate revenue numbers to look better against Oracle and had to do a two-month bid in the federal clink.
Instead, the Snowflake versus Databricks battle has expanded to be about the ecosystem around the database. That is, it is about the infrastructure people use to get their data into a database and then the tools they use on that data. Vectorized execution engines for analytical queries are a commodity now. Databricks and every other OLAP vendor follow Snowflake’s architecture design from 2013, originally based on one of the Snowflake co-founders’ Ph.D. thesis. What matters now are quality-of-life facets (which are hard to monetize and compare with competitors), compatibility with other tools, and AI/LLM magic.
At least the competition between Snowflake and Databricks has an upside for consumers. Such ferocity means better products and technology for data (e.g., Snowflake’s Polaris is now an Apache project) and eventually (hopefully) lower prices. It’s not like the previous pissing match between Oracle and SalesForce CEOs, where it was two rich guys taking pop shots at each other during their expensive conferences.
Shoving Ducks Into Everything
In the same way that Postgres is the default choice for anyone starting a new operational database, DuckDB has entered the zeitgeist as the default choice for someone wanting to run analytical queries on their data. Pandas previously held DuckDB's crowned position. Given DuckDB's insane portability, there are several efforts to stick it inside existing DBMSs that do not have great support for OLAP workloads. This year, we saw the release of four different extensions to stick DuckDB up inside Postgres.
The first announcement came in May 2024 when Crunchy Data revealed their proprietary bridge for rewiring Postgres to route OLAP queries to DuckDB. They later announced an expanded version of their extension to leverage DuckDB's geospatial capabilities to accelerate PostGIS queries.
In June 2024, ParadeDB announced their open-source extension (pg_analytics) that uses Postgres' foreign data wrapper API to call into DuckDB; they previously were using DataFusion in an earlier version (pg_lakehouse) but switched to the Duck.
Then, in August 2024, the next DuckDB-for-Postgres extension (pg_duck) came out. The source code for this extension is hosted under the DuckDB Labs GitHub organization. As such, this is the officially sanctioned DuckDB extension for Postgres. The original announcement touted this project as being a collaboration between MotherDuck, Hydra, Microsoft, and Neon. The latter two were (allegedly) kicked out of the mix over a dispute on development controls, similar to Arabian Prince leaving NWA. The repository now only lists it as a joint effort between MotherDuck and Hydra.
The latest DuckDB extension dropped in November 2024 with pg_mooncake. Mooncake differs from the other three because it supports writing data through Postgres into Iceberg tables with full transaction support.
Andy’s Take:
Most OLAP queries do not access that much data. Fivetran analyzed traces from Snowflake and Redshift and showed that the median amount of data scanned by queries is only 100 MB. Such a small amount of data means a single DuckDB instance is enough for most to handle most queries.
DuckDB's convenience and portability are the reasons for its proliferation in the Postgres community. Although ClickHouse has existed since 2016, it was not as easy to run as DuckDB until recently (see this blog article that discusses the steps to deploy ClickHouse in 2018). These DuckDB extensions are a single entry point to the broader data ecosystem. Users no longer need to install separate extensions to access data in Iceberg and separately for S3. DuckDB can handle all of that for you. It allows organizations to gain high-performance analytics without needing an expensive data warehouse.
Postgres' support for extensions and plugins is impressive. One of the original design goals of Postgres from the 1980s was to be extensible. The intention was to easily support new access methods and new data types and operations on those data types (i.e., object-relational). Since 2006, Postgres' "hook" API. Our research shows that Postgres has the most expansive and diverse extension ecosystem compared to every other DBMS. We also found that the DBMS's lack of guard rails means that extensions can interfere with each other and cause incorrect behavior.
Earlier projects that added columnar storage to Postgres (e.g., Citus, Timescale) only solved part of the problem. Columnar data formats improve retrieving data from storage. However, a DBMS cannot fully exploit those formats if it still uses a row-oriented query processing model (e.g., Postgres). Using DuckDB provides both columnar storage and vectorized query processing.
There is likely a turducken joke here involving an elephant, but I will not make it because I do not want to get fired or put on probation by the university (again).
Random Happenings
Many one-off events happened with databases last year that you might have overlooked. Here is a quick summary of them:
Releases:
- Amazon Aurora DSQL
There isn't much public information yet about how AWS implemented their new "Spanner-like" DBMS (see Mark Brooker's discussion about the DBMS architecture). The key ideas are a distributed log service (rumors were it was going to be based on now-defunct QLDB) and timestamp ordering via Time Sync. But this announcement shows you how much brand recognition the name "Aurora" carries in the database world because AWS used it for this new DBMS that seemingly shares no code with their flagship Aurora Postgres RDS offering.
- CedarDB
Umbra is one of the most state-of-the-art DBMSs written by the world's greatest database systems researcher (Thomas Neumann). But Thomas is content with staying at his university to work on Umbra, remaining comfortably on top of the Clickbench leaderboard, and not worrying about pesky customers. That's why his top Ph.D. students forked his code and are commercializing it as CedarDB.
- Google Bigtable
The only interesting part of this announcement is that the former vanguard of the NoSQL movement in the late 2000s now supports SQL in 2024.
- Limbo
Turso has been working on the libSQL fork of SQLite for a while, but they went all out in 2024 by announcing a complete rewrite of SQLite in Rust. In their announcement, they correctly point out that the value of SQLite is not just from its code, but also from the insane test engineering that ensures it runs correctly everywhere. That is why the Limbo developers are working with a deterministic testing startup by ex-FoundationDB people. See FoundationDB's 2020 CMU-DB talk for more information on what this testing means.
- Microsoft Garnet
This key-value store is the successor to the impressive FASTER system from Microsoft Research. It is compatible with Redis and supports inter-query parallelism, larger-than-memory databases, and real transactions. Redis should not be anybody's first choice these days.
- MySQL v9
Six years after MySQL v8 went GA, the team turned v9 out on the streets. But people quickly found that it crashed if your database had more than 8000 tables. I am underwhelmed with the feature list in this new major version. Oracle is putting all its time and energy into its proprietary MySQL Heatwave service. MySQL is still widely used, but the excitement is not there anymore. Everyone has moved on to Postgres.
- Prometheus v3
It has been seven years since the latest major version of Prometheus. There are so many compatible alternatives now that the OG Prometheus may not be the best option for some organizations.
Acquisitions:
- Alteryx → Private Equity
I've never met anybody who uses Alteryx, and I don't have an opinion about them.
- MariaDB → Private Equity
Hopefully the PE people buying the MariaDB Corporation can clean up the mess. See my analysis from last year about the MariaDB dumpster fire.
- OrioleDB → Supabase
This purchase makes sense if you are one of the leading Postgres ISVs. Postgres has a great front-end but an outdated storage architecture. OrioleDB fixes that problem.
- PeerDB → ClickHouse
Better ETL tooling to get data out of Postgres and into ClickHouse. This is a smart move by ClickHouse, Inc.
- PopSQL → Timescale
They bought themselves a fancy SQL editor UI. It is a quality-of-life improvement.
- Speedb → Redis Ltd.
See the discussion above. They are likely going to use Speedb to allow Redis to spill data to disk. Speedb's developers never explained what changes and improvements they made in their RocksDB fork (or I could not find it?). See Mark Callaghan's recent comparison of Speedb vs. RocksDB.
- Rockset → OpenAI
This is big news for the company, but unfortunately they had to shutdown the DBaaS in September 2024. Rockset had a great engineering team with some of the best database engineers from Facebook. I just never liked how their DBMS stored three copies of your data in its indexes.
- Tabular → Databricks
Again, see the discussion above. Iceberg is the standard (sorry Hudi); even Amazon S3 now supports it. It remains to be seen how the adoption of Polaris will evolve and whether they will be able to maintain compatibility in the long term.
- Verta.ai → Cloudera
I guess Cloudera is still alive?
- Warpstream → Confluent
Rewriting Kafka in golang but then making it spill to S3. I'm happy for the Warpstream team, but Confluent could have done this themselves.
Funding:
- Databricks - $10 billion Series J
- DBOS - $8.5 million Seed Round
- LanceDB - $8 million Seed Round
- SDF - $9 million Seed Round
- SpiceDB - $12 million Series A
- TigerBeetle - $24 million Series A
There are a few more raises from CedarDB, SpiralDB, and others but those amounts are not public yet.
Deaths:
- Amazon QLDB
If Amazon can't figure out how to make money on a blockchain database, then nobody can. And yes, I know QLDB is not a true P2P blockchain, but it's close enough.
- OtterTune
Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
I want to also give special props to Andres Freund for discovering the xz backdoor in 2024 while working on Postgres at Microsoft. This attack was a two-year campaign to inject malicious code into an important compression library widely used in computing. Although the backdoor targeted SSH and not Postgres directly, it is another example of why database engineers are some of the best programmers in the world.
Andy’s Take:
Databricks has blown away all other fundraising in the world of databases for the second year in a row with a disgustingly brash $10 billion Series J round. This is after their $500 million Series I in 2023 and $1.6 billion Series H in 2021. What is different about this time is this funding was for buying stock from employees who were getting impatient about Databricks' inevitable IPO. CMU-DB has several alumni at Databricks, including a former #1 ranked Ph.D. student. I know some of them are anxiously awaiting the Databricks IPO before deciding what to do next.
The upcoming year is going to be the test of strength for many database startups. Nobody wants to be the next MariaDB Corporation, and thus several are waiting to ride Databricks' wake before going IPO themselves. Declining interest rates in the upcoming year may open up additional funding for several database companies that have raised large amounts more than two years ago (e.g., CockroachDB, Starburst, Imply, DataStax, SingleStore, Firebolt). The one standout from this crowd is dbtLabs, which I have heard is comfortably crushing it.
See also the Database of Databases list of new DBMSs released in 2024.
Can’t Stop, Won’t Stop
Do you know who had their 80th birthday this year? The legendary Larry Ellison! Once again, we see that he is a man who refuses to settle down or be put into a box. First, Larry propelled himself up Forbes' Billionaires List to become the third richest person in the world. In March 2024, the Oracle stock rose so much that he made $15 billion in a single day. Flush with cash, Larry went shopping in July 2024 and signed a deal to purchase Paramount Studios at $6 billion for his only son (third wife). He then decided to relax by buying a Palm Beach resort for only $277 million. These moves happened in just one year, and databases paid for all of it. But these are mere trifles compared to Larry's most significant accomplishment in 2024.
Everyone I know was surprised when our Larry Ellison news alerts woke us up in the middle of the night in November 2024. The headlines were touting how Larry helped the University of Michigan football program recruit the premiere college quarterback. The university had previously announced that this player was transferring from Louisiana State to Michigan. Their press release included a curious acknowledgement to "Larry and his wife Jolin" for helping with the recruiting effort. Reporters soon confirmed that this "Larry" was the one and only Larry Ellison! Larry contributed $12 million to the booster campaign to bankroll the best quarterback's move to Michigan.
The bigger mystery in this story was the identity of this "Jolin" person. Investigators found older photos of Larry watching a tennis match with a woman wearing a Michigan hat. Then, two weeks later, a major news organization broke the story at 5:30 am (my alerts woke me up again) that the woman's identity was Jolin (Keren) Zhu and they confirmed that she was Larry's new wife.
Andy’s Take:
I am beaming with pride over what Larry accomplished in the past year. He famously did not graduate from any university and had no prior connection to the University of Michigan. And yet, because the love of his life went to Michigan about a decade ago, Larry made magic happen by writing a check for a measly $12 million (about 0.0055% of his net worth). I told Larry this especially means a lot to me because my former #1 ranked Ph.D. student is now a professor in Michigan's Computer Science department with their famous Database Group.
What is even more fantastic about this story is that Larry is again in love! Too many people struggle in today's world to find that special somebody. Dating apps are a mess, speed dating events are awkward, and it is now considered uncouth to hang around a playground to meet single parents when you do not have kids of your own. Then, just when you think you finally found the right person, it all falls apart when you learn they do not wash their socks regularly or like to put hot sauce on cold cereal. That is why everyone was telling me that Larry would never get married again after his 2010 divorce from romance novelist Melanie Craft (fourth wife). Those people were telling me the same thing after his 2020 divorce from Nikita Kahn (fifth wife). But I knew better, and Larry proved me right with his surreptitious marriage to Keren Zhu (sixth wife)!
Conclusion
I was planning on starting this article boasting how this is the first time in three years where I was celebrating NYE not sick. But then my biological daughter gave me COVID so I'm laid up with that. I got boosted back in September and they gave me Paxlovid, so I'll survive this.
I am disappointed that OtterTune is dead. But I learned a lot and got to work with many brilliant people. I am a big fan of Intel Capital and Race Capital for sticking with us to the end. I hope to announce our next start-up soon (hint: it’s about databases).
In the meantime, I am happy to be back full-time at Carnegie Mellon University. Jignesh Patel and I have some baller research projects that we hope to turn out this upcoming year. I am also looking forward to teaching a new course on query optimization this semester. I need to figure out to juice my stats because in September 2024, Wikipedia removed the article about me over not having enough citations.
We are staying true to DJ Mooshoo while he is locked up in Cook County. We hope to free him in 2025.
Lastly, I want to give a shout-out to ByteBase for their article Database Tools in 2024: A Year in Review. In previous years, they emailed me asking for permission to translate my end-of-year database articles into Chinese for their blog. This year, they could not wait for me to finish writing this one, so they jocked my flow and wrote their own off-brand article with the same title and premise.