IR Krzysztof Czuba's IR LAB page

Basic Information

Project Title: Translingual IR
Name: Krzysztof Czuba(email: kczuba@cs.cmu.edu)

Abstract
Proposal and Timelines
System Description
Experiments
Results

Abstract

The project is an extension of my 11-741 project. The topic is Translingual Information Retrieval using different retrieval methods and models. In the first part of the project, the following methods will be applied to the English-Spanish UNICEF collection: DICT, DICT+Wordnet, GVSM, PRF, LSI. In the second part of the project, an automatically generated corpus will be used for methods requiring traning on bilingual corpora. The corpus has been obtained from the translation service provided on the Web by AltaVista which is based on the Systran MT system.

The original project will be extended with an experiment in which the DICT method is run using an automatically translated dictionary. The dictionary will be obtained by extracting all terms appearing in the UNICEF collection and translating them using SYSTRAN. This experiment will provide a baseline for comparing the results of the runs in which the whole corpus was automatically translated.

Also, a more sophisticated evaluation method will be used to assess how well all the methods perform. In particular, many runs on different partitions of the corpus in training, fine-tuning and test set will be done. The addition of the fine-tuning set is especially interesting since the previous results the parameters were tuned on the training set.

Proposal and Timelines

Task	to be done by	status
Set up the web page	October 20th	done
Extract and translate the vocabulary	November 1st	done
Evaluate DICT using the automatically translated dictionary	November 5th	done
Code for automatic evaluation		done
Evaluation of PRF	November 1st	pending
Evaluation of GVSM and LSI on UNICEF	November 15th	pending
Write-up	December 10th	pending

Results for GVSM and LSI on the finetuning set

GVSM	paragraph			document			sentence
Training Set	atc	ltc	ntc	atc	ltc	ntc	atc	ltc	ntc
No WordNet expansion	200: 0.3638	130: 0.3860	80: 0.4229	140: 0.3443	150: 0.3843	80: 0.4231
Single WordNet expansion	170: 0.3551	140: 0.3751	250: 0.3832	30: 0.3327	40: 0.3728	100: 0.3942
Full WordNet expansion	290: 0.3498	330: 0.3796	400: 0.3827	120: 0.3733		110: 0.3900

LSI	paragraph			document			sentence
Training Set	atc	ltc	ntc	atc	ltc	ntc	atc	ltc	ntc
No WordNet expansion	270: 0.3316	230: 0.3653	310: 0.4165	200: 0.3829	250: 0.4222	400: 0.4428
Single WordNet expansion	390: 0.3238	340: 0.3665	340: 0.3988	390: 0.3682	290: 0.4192	100: 0.4214
Full WordNet expansion	270: 0.3090	360: 0.3467	380:0.3712	350: 0.4251		100: 0.4285

Results on the test set

	no WordNet expansion	single WordNet expansion	full WordNet expansion
GVSM	0.4191
LSI	0.4450