Discovering Patterns and Relationships among Investors, Startups and Beyond
In the past decade, information technology has spawned a great number of companies with wonderful products that make our lives more entertaining and convenient. Naturally, it is very meaningful to conduct data analysis in the startup arena to find interesting patterns, such as which companies are more likely to go IPO or get acquired, which domains yield more successful products or companies in a particular time frame, who tend to invest in which companies, and so on. We have done some research along those lines in the recent past, and we would like to share our shallow thoughts with you in this document.
TechCrunch released a public CrunchBase corpus in a downloadable form on June 6, 2013, and you may want to explore their data set as well. However, this article is still useful in that it gives you a sense of our ideas and what we have done.
Corpus | Download | Acquisition Prediction | More Ideas |
A large corpus with high-quality data is an essential step toward any data mining task, and to that ends, we utilized the public data set provided by CrunchBase in our research. This is a great data source with the following benefits:
Corpus Statistics and Links
To better illustrate our research result and get you started in this area quickly (if you have interests of course), we share within Carnegie Mellon our corpus that we collected from CrunchBase and TechCrunch in mid 2012. If you do not have a CMU IP and yet want to quickly play with the data for research purpose only, send me an email.
Description | Source | Size | Link |
Company profiles | CrunchBase | 101,049 | |
Person profiles | CrunchBase | 133,394 | |
Financial organization profiles | CrunchBase | 8,449 | |
Product profiles | CrunchBase | 21,761 | |
Service provider profiles | CrunchBase | 4,766 | |
News articles | TechCrunch | 8,689 (#companies) | |
News article urls | TechCrunch | 8,605 (#companies) |
Data Format and Scripts
CrunchBase assigns a unique ID called permalink to each entity. For example, the permalink for Accel Partners is accel-partners. In the CrunchBase corpus we provide above, all file names are permalinks. For the package with news article urls, each file has on each line a date followed by the URL of the corresponding TechCrunch article, with the file name being a permalink. The data set with TechCrunch articles is structured somewhat differently. Each company has a folder named by its permalink. Inside each folder, there are one or more files with each representing a TechCrunch article about the corresponding company and named by the MD5 hash of the article url.
The urls and TechCrunch articles are stored as plain text in our corpus, while the JSON entity profiles from CrunchBase were saved by the dump() method of the pickle module in Python. For the latter, you can simply use this code to load a JSON object from each pickled entity profile file.
In this work, we examined the task of Merger and Acquisition (M\&A) prediction, which has been an interesting and challenging research topic in the past a few decades. Specifically, we used the profiles and news articles for companies and people on TechCrunch, and explored topic features via topic modeling techniques, as well as a set of other novel features of our design within a machine learning framework. We conducted experiments of the largest scale in the literature, and achieved a high true positive rate (TP) between 60% to 79.8% with a false positive rate (FP) mostly between 0% and 8.3% over company categories with a small number of missing attributes in the CrunchBase profiles. Please refer to our paper [short][long] for more details.
The work introduced above is just our first step, and there are actually a lot more interesting ideas that could be explored on the CrunchBase data set. If you have some thoughts and would like to share with us, we are more than happy to hear from you.