


UNICEF: A small/medium data base, translingual.


This corpus was designed for Translingual Information Retrieval.

The large UN multilingual Corpus from the Linguistic Data Consortium contains about 500Mb for each language. Using formating codes and alignment methods, a subset of the data was extracted and segmented, it consists of 2255 document pairs pertaining UNICEF reports and deliberations. 1134 documents were randomly selected for training, and 1121 for testing.
30 queries have been formulated (by Jaime Carbonell), and human have produced 33630 relevance judgements, that are used for evaluation. Queries length varies from 6 to 36 word, with an average of 14 words per query. The number of relevant documents goes from 0 to 70 with average 16.
In the directory you will find everything.
Tom has produced more documentation. I also keep a backup

Bibliographical References

Translingual Information Retrieval in AIJ, (Yang, Carbonell, Brown, Frederking)