IR Krzysztof Czuba's term project

Basic Information

Project Title: Translingual IR
Name: Krzysztof Czuba(email: kczuba@cs.cmu.edu)
Presentation Date: Apr 24th;

Abstract
Proposal and Timelines
System Description
Experiments
Results

Abstract

The topic of the project is Translingual Information Retrieval using different retrieval methods and models. In the first part of the project, the following methods will be applied to the English-Spanish UNICEF collection: DICT, DICT+Wordnet, GVSM, PRF, LSI. In the second part of the project, an automatically generated corpus will be used for methods requiring traning on bilingual corpora. The corpus will be obtained from the translation service provided on the Web by AltaVista which is based on the Systran MT system.

Proposal and Timelines

Task	to be done by	status
Set up the web page	March 3rd	done
Investigate the Web translation engine	March 6th	done
Code for obtaining translations form the Web	March 10th	done
Translation of the UNICEF corpus	March 20th	done
Investigate and run DICT on UNICEF	March 17th	done
Investigate and run GVSM on UNICEF	March 17th	done
Investigate and run LSI on UNICEF	March 22th	done
Investigate and run PRF of UNICEF	March 25th	done
Corpus clean-up	March 30th	done
Implement query expansion using WordNet	April 5th	done
Rerun the experiments with PRF, LSI, GVSM with the generated corpus	April 15th	done
Write-up, presentation	April 24th	done

Results

Queries
WN-1: Queries expanded using the first WordNet sense
WN-ALL: Queries expanded using all WordNet senses
DICT
- Best configuration: ntc-ntc, no WordNet query expansion

Precision/recall curves and 11pt Avg for DICT
PRF
Best configuration with WordNet
- Human: WN-ALL, paragraphs, K = 5, E = 80, ntc-ntc;
- MT: WN-ALL, paragraphs, K = 5, E = 90, ntc-ntc;
Best configuration without WordNet
- human: paragraphs, K = 5, E = 80, ntc-ntc;
- MT: paragraphs, K = 5, E = 90, ntc-ntc;
GVSM (run by Xin Liu)
Best configuration with WordNet
- Human: paragraphs, SP = 300;
- MT: WN-ALL, documents, SP = 100,
Best configuration without WordNet
- human: paragraphs, SP = 100;
- MT: paragraphs, paragraphs, SP = 200;
GVSM: 11pt average precision as function of sparsification, documents
GVSM: 11pt ave. precision as function of sparsification, paragraphs
GVSM: Best performing configurations for GVSM
LSI (run by Xin Liu)
Best configuration with WordNet
- Human: documents, SV=200;
- MT: WN-ALL, paragraphs, SV = 300;
Best configuration without WordNet
- human: documetns, SV = 1000;
- MT: paragraphs, paragraphs, SV = 300;
LSI: 11pt ave. precision as function of #singular values, documents
LSI: 11pt ave. precision as function of #singular values, paragraphs
LSI: Best performing configurations

Best 11pt average precision for all the methods:

DICT PRF GVSM LSI

no WordNet human 0.2898 0.3757 0.4059 0.4333

no WordNet translations -- 0.3671 0.3772 0.3999

with WordNet human 0.2909 0.4027 0.3894 0.4220

with WordNet translations -- 0.3786 0.3822 0.4162

Final report

Conclusions

The results from the previous section suggest that corpus-based methods for TLIR can be used on automatically translated corpora to achieve retrieval quality better than by the dictionary method. In particular, it is interesting that the performance can be as good as 97.3% as compared to the human-translated corpus (PRF with WordNet query expansion).

It is also interesting that the WordNet query expansion strategy produced such good results. WN-ALL is the best set of queries. However, since it helped both in the case of the human and the machine-translated corpus, it is not clear how helpful it was in alleviating the paraphrase problem. An analysis of single queries could help answer this question.

Another question which remains unclear is whether the improvement over the dictionary method is actually due to the use of the MT system. SYSTRAN's lexicon is impressive in terms of coverage and the better results might be just a consequence. One way to verify that would be to obtain a bilingual dictionary from the one used by SYSTRAN and see how well the dictionary method performs using it. However, the results of the tests with GVSM and LSI suggest that there might be benefits to be gained from using the MT translated corpus over a dictionary.

As for further work, it would be interesting to see the results on the sentence-aligned corpus. Also, it would be very interesting to see how the MT-based methods would perform on a different collection. MEDLINE is available, but it might not be the right candidate due to the technical terminology it contains.

Also, running the dictionary method with a dictionary based on the SYSTRAN lexicon would be a very instructive experiment but it is probably not very likely that such a dictionary could be obtained.

last update: April 24th, 1998

	DICT	PRF	GVSM	LSI
no WordNet human	0.2898	0.3757	0.4059	0.4333
no WordNet translations	--	0.3671	0.3772	0.3999
with WordNet human	0.2909	0.4027	0.3894	0.4220
with WordNet translations	--	0.3786	0.3822	0.4162