The Ancestry.com Forum Dataset
The Ancestry.com Forum Dataset is no longer available for download. The supplemantal material and evaluation tool is still available below.
The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.
In addition to the document collection, queries from Ancestry.com's query log and pairwise preference relevance judgements for a
This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.
Contact: Jonathan Elsas.
Document Collection
The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.
Number of Messages | 22,054,728 |
Number of Threads | 9,040,958 |
Number of Sub-forums | 165,358 |
Number of Unique Authors | 3,775,670 |
Message Date Range | December 1995 - July 2010 |
Size | 5 GB (compressed) |
The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference. An example document is shown below:
<DOCNO>surnames.darcy.000001.000376798</DOCNO>
<PID>000376798</PID>
<SUBFORUM>surnames.darcy</SUBFORUM>
<DATE_STR>03 May 1999</DATE_STR>
<DATE_NUM>199905031200</DATE_NUM>
<THREAD_ID>000001</THREAD_ID>
<POST_ID>000001</POST_ID>
<POST_URL>http://boards.ancestry.com/surnames.darcy/1/mb.ashx</POST_URL>
<AUTHOR_NAME>Heather XXXXXXXX</AUTHOR_NAME>
<AUTHOR>53014</AUTHOR>
<POST_TITLE>Thomas Darcy</POST_TITLE>
<TEXT>I am interested in any information available on a Thomas Darcy, born in NYC, NY to Michael & Catherine SMITH Darcy. He is the eldest child and appears in the census with the family as well as residing with his mother at the time of her death in the early 1910's. He is listed in the 1913 NY Directory as a clerk. Please contact me at XXXXX@XXXX.XXX if any of this information sounds familar.</TEXT>
</DOC>
All the messages have the following fields:
DOCNO | Unique message identifier, containing thread membership information. |
PID | Unique numeric message identifier. |
SUBFORUM | Subforum containing the post. |
DATE_STR | Publication date of the post. "01-01-1900" if missing. |
DATE_NUM | Numeric representation of the publication date. |
THREAD_ID | Thread identifier. Unique per subforum. |
POST_ID | Post identifier. Unique per subforum. |
POST_URL | URL of the post on the Ancestry.com Forum website. |
AUTHOR_NAME | Author name. |
AUTHOR | Unique numeric author identifier. "0" if missing. |
POST_TITLE | Title of the post. |
TEXT | Text content fo the post. |
The message threading structure can be identified from the content of the SUBFORUM, THREAD_ID and POST_ID fields:
- Messages belonging to the same thread have identical SUBFORUM and THREAD_ID values.
- The first message of a thread has matching THREAD_ID and POST_ID values, as in the example above.
- A message (A) is a direct response to another message (B) in the thread if message A's POST_ID contains message B's POST_ID followed by four characters.
An example threading structure is shown below, with the POST_IDs in a single thread show in the nodes, and edges representing a message response relationship.
Obtaining a Copy of the Dataset
The Ancestry.com Forum Dataset is no longer available for download. The supplemantal material and evaluation tool is still available below.
Thread Retrieval Task
In addition to the document collection, we distribute a query set and relevance judgements appropriate to use as an information retrieval test collection for studying message thread retrieval.
The queries distributed with this dataset were sampled from the Ancestry.com query log. The query set reflects the primary type of information need expressed by the users of Ancestry.com. All of the queries distributed in this query set contain at least a person's name, and half of these queries contain additional information such as a location or other keyword.
Pairwise preference assessments were collected using a simulated pool of retrieval runs. The preference assessment collection followed the guidelines in Carterette et al.'s SIGIR 2008 paper "A Test Collection of Preference Judgments". The pooling and assessment process is described in detail in the tech report distributed with this dataset.
Along with the queries and relevance judgments, you can also download the evaluation tool to compute pairwise-preference performance measures. This tool is released open-source.
IMPORTANT NOTE: The the pairwise preference data uses DOCNO of the FIRST message of a thread as the document identifier. See the document collection section for an explanation of how the POST_ID and THREAD_ID fields are used to identify the first message of a thread.
Downloads:
- Query Set, one query per line, format [query ID]:[query text]
- Pairwise Preferences, one preference per line, format described in evaluation tool README below
- Pairwise Preference Evaluation Tool
Publishing Research Using the Dataset
If you use the Ancestry.com Forum Dataset in published research, you must provide attribution to Ancestry.com as the source of the dataset. Also, please include a reference to the tech report describing the dataset:Jonathan Elsas, "The Ancestry.com Forum Dataset", CMU LTI Tech Report CMU-LTI-017, 2011.