Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University
Fall
2009
We have several data sets available to support class projects.
1. Co-occurrence statistics between noun phrases (e.g., 'New York City') and contexts (e.g., 'mayor of __'). We have available two sets of this data.
NPContext1. Data
describing a approximately 88,000 of the most frequent noun phrases
(NP's) in English, approximately 99,000 of the most frequent contexts
(e.g., 'mayor of __'), and the frequency with which each co-occurs with
the other. This data has been collected from 200 million web
pages, considering only NP's and contexts that occur at least 500 times
each. This data is available in NP x Context array whose elements
give the frequencies of co-occurrence of NP i with Context j.
The array is sparse, and fits into Matlab on a reasonable size laptop.
Here is the data (78 Mbytes) and a description of its format.
NPContext2. Similar
data, but much larger (850 Mbytes) and somewhat cleaner in its choice of what constitutes
a NP and a context. To get started, and if you wish to
manipulate your data easily loaded into Matlab as a sparse matrix, then
you probably want NPContext1 instead. NPContext2 is available upon request. Here is a brief description of it.
2. Browsable knowledge bases,
including lists of instances of animals, shapes, people, etc., and
lists of instances of relations such as plays_sport(person,team),
learned extraction patterns, and more. This data is available here, and software to access it by program is just below. Some items from this site that might be especially useful are:
Java softare paralleling the Matlab version. See the Matlab documentation above, and then the comments in the included RTWJavaDemo.java
3. (coming soon:). Co-occurence statistics between individual English
words. In particular, a 50k by 50k array giving the frequency of
co-occurrence of the 50k most frequent words/tokens in English, with
one another.