Advanced Statistical Language Processing: Reading the Web (10-709)

Data and Software

Tom Mitchell
Machine Learning Department
School of Computer Science, Carnegie Mellon University

Fall 2009

We have several data sets available to support class projects.

1. Co-occurrence statistics between noun phrases (e.g., 'New York City') and contexts (e.g., 'mayor of __'). We have available two sets of this data.

NPContext1. Data describing a approximately 88,000 of the most frequent noun phrases (NP's) in English, approximately 99,000 of the most frequent contexts (e.g., 'mayor of __'), and the frequency with which each co-occurs with the other. This data has been collected from 200 million web pages, considering only NP's and contexts that occur at least 500 times each. This data is available in NP x Context array whose elements give the frequencies of co-occurrence of NP i with Context j. The array is sparse, and fits into Matlab on a reasonable size laptop. Here is the data (78 Mbytes) and a description of its format.

NPContext2. Similar data, but much larger (850 Mbytes) and somewhat cleaner in its choice of what constitutes a NP and a context. To get started, and if you wish to manipulate your data easily loaded into Matlab as a sparse matrix, then you probably want NPContext1 instead. NPContext2 is available upon request. Here is a brief description of it.

2. Browsable knowledge bases, including lists of instances of animals, shapes, people, etc., and lists of instances of relations such as plays_sport(person,team), learned extraction patterns, and more. This data is available here, and software to access it by program is just below. Some items from this site that might be especially useful are:

a browsable knowledge base
lists of extracted instances of categories and relations
list of learned extraction patterns (e.g., '__ is a physical science') for various categories and relations

Matlab software to access the KB through your program. See the software documentation.
Java softare paralleling the Matlab version. See the Matlab documentation above, and then the comments in the included RTWJavaDemo.java

3. (coming soon:). Co-occurence statistics between individual English words. In particular, a 50k by 50k array giving the frequency of co-occurrence of the 50k most frequent words/tokens in English, with one another.