IR Krzysztof Czuba's IR LAB page

Basic Information

Contents


Abstract

The project is an extension of my 11-741 project. The topic is Translingual Information Retrieval using different retrieval methods and models. In the first part of the project, the following methods will be applied to the English-Spanish UNICEF collection: DICT, DICT+Wordnet, GVSM, PRF, LSI. In the second part of the project, an automatically generated corpus will be used for methods requiring traning on bilingual corpora. The corpus has been obtained from the translation service provided on the Web by AltaVista which is based on the Systran MT system.

The original project will be extended with an experiment in which the DICT method is run using an automatically translated dictionary. The dictionary will be obtained by extracting all terms appearing in the UNICEF collection and translating them using SYSTRAN. This experiment will provide a baseline for comparing the results of the runs in which the whole corpus was automatically translated.

Also, a more sophisticated evaluation method will be used to assess how well all the methods perform. In particular, many runs on different partitions of the corpus in training, fine-tuning and test set will be done. The addition of the fine-tuning set is especially interesting since the previous results the parameters were tuned on the training set.

Proposal and Timelines

Task

to be done by

status

Set up the web page October 20th done
Extract and translate the vocabulary November 1st done
Evaluate DICT using the automatically translated dictionary November 5th done
Code for automatic evaluation done
Evaluation of PRF November 1st pending
Evaluation of GVSM and LSI on UNICEF November 15th pending
Write-up December 10th pending

Results for GVSM and LSI on the finetuning set

GVSM
paragraph
document
sentence
Training Set
atc
ltc
ntc
atc
ltc
ntc
atc
ltc
ntc
No WordNet expansion
200: 0.3638
130: 0.3860
80: 0.4229
140: 0.3443
150: 0.3843
80: 0.4231
Single WordNet expansion
170: 0.3551
140: 0.3751
250: 0.3832
30: 0.3327
40: 0.3728
100: 0.3942
Full WordNet expansion
290: 0.3498
330: 0.3796
400: 0.3827
120: 0.3733
110: 0.3900
LSI
paragraph
document
sentence
Training Set
atc
ltc
ntc
atc
ltc
ntc
atc
ltc
ntc
No WordNet expansion
270: 0.3316
230: 0.3653
310: 0.4165
200: 0.3829
250: 0.4222
400: 0.4428
Single WordNet expansion
390: 0.3238
340: 0.3665
340: 0.3988
390: 0.3682
290: 0.4192
100: 0.4214
Full WordNet expansion
270: 0.3090
360: 0.3467
380:0.3712
350: 0.4251
100: 0.4285

Results on the test set

no WordNet expansion single WordNet expansion full WordNet expansion
GVSM
0.4191
LSI
0.4450
last update: Dec 7th, 1998