The Ancestry.com Forum Dataset

The Ancestry.com Forum Dataset is no longer available for download. The supplemantal material and evaluation tool is still available below.

The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.

In addition to the document collection, queries from Ancestry.com's query log and pairwise preference relevance judgements for a message thread retrieval task using this online forum are distributed.

This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.

Contact: Jonathan Elsas.

Document Collection

The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.

Number of Messages	22,054,728
Number of Threads	9,040,958
Number of Sub-forums	165,358
Number of Unique Authors	3,775,670
Message Date Range	December 1995 - July 2010
Size	5 GB (compressed)

The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference. An example document is shown below:

<DOC>
<DOCNO>surnames.darcy.000001.000376798</DOCNO>
<PID>000376798</PID>
<SUBFORUM>surnames.darcy</SUBFORUM>
<DATE_STR>03 May 1999</DATE_STR>
<DATE_NUM>199905031200</DATE_NUM>
<THREAD_ID>000001</THREAD_ID>
<POST_ID>000001</POST_ID>
<POST_URL>http://boards.ancestry.com/surnames.darcy/1/mb.ashx</POST_URL>
<AUTHOR_NAME>Heather XXXXXXXX</AUTHOR_NAME>
<AUTHOR>53014</AUTHOR>
<POST_TITLE>Thomas Darcy</POST_TITLE>
<TEXT>I am interested in any information available on a Thomas Darcy, born in NYC, NY to Michael & Catherine SMITH Darcy. He is the eldest child and appears in the census with the family as well as residing with his mother at the time of her death in the early 1910's. He is listed in the 1913 NY Directory as a clerk. Please contact me at XXXXX@XXXX.XXX if any of this information sounds familar.</TEXT>
</DOC>

All the messages have the following fields:

DOCNO	Unique message identifier, containing thread membership information.
PID	Unique numeric message identifier.
SUBFORUM	Subforum containing the post.
DATE_STR	Publication date of the post. "01-01-1900" if missing.
DATE_NUM	Numeric representation of the publication date.
THREAD_ID	Thread identifier. Unique per subforum.
POST_ID	Post identifier. Unique per subforum.
POST_URL	URL of the post on the Ancestry.com Forum website.
AUTHOR_NAME	Author name.
AUTHOR	Unique numeric author identifier. "0" if missing.
POST_TITLE	Title of the post.
TEXT	Text content fo the post.

The message threading structure can be identified from the content of the SUBFORUM, THREAD_ID and POST_ID fields:

Messages belonging to the same thread have identical SUBFORUM and THREAD_ID values.
The first message of a thread has matching THREAD_ID and POST_ID values, as in the example above.
A message (A) is a direct response to another message (B) in the thread if message A's POST_ID contains message B's POST_ID followed by four characters.

An example threading structure is shown below, with the POST_IDs in a single thread show in the nodes, and edges representing a message response relationship.

Obtaining a Copy of the Dataset

The Ancestry.com Forum Dataset is no longer available for download. The supplemantal material and evaluation tool is still available below.

Thread Retrieval Task

In addition to the document collection, we distribute a query set and relevance judgements appropriate to use as an information retrieval test collection for studying message thread retrieval.

The queries distributed with this dataset were sampled from the Ancestry.com query log. The query set reflects the primary type of information need expressed by the users of Ancestry.com. All of the queries distributed in this query set contain at least a person's name, and half of these queries contain additional information such as a location or other keyword.

Pairwise preference assessments were collected using a simulated pool of retrieval runs. The preference assessment collection followed the guidelines in Carterette et al.'s SIGIR 2008 paper "A Test Collection of Preference Judgments". The pooling and assessment process is described in detail in the tech report distributed with this dataset.

Along with the queries and relevance judgments, you can also download the evaluation tool to compute pairwise-preference performance measures. This tool is released open-source.

IMPORTANT NOTE: The the pairwise preference data uses DOCNO of the FIRST message of a thread as the document identifier. See the document collection section for an explanation of how the POST_ID and THREAD_ID fields are used to identify the first message of a thread.

Downloads:

Query Set, one query per line, format [query ID]:[query text]
Pairwise Preferences, one preference per line, format described in evaluation tool README below
Pairwise Preference Evaluation Tool

Publishing Research Using the Dataset

If you use the Ancestry.com Forum Dataset in published research, you must provide attribution to Ancestry.com as the source of the dataset. Also, please include a reference to the tech report describing the dataset:

Jonathan Elsas, "The Ancestry.com Forum Dataset", CMU LTI Tech Report CMU-LTI-017, 2011.