IR Krzysztof Czuba's term project

Basic Information

Contents


Abstract

The topic of the project is Translingual Information Retrieval using different retrieval methods and models. In the first part of the project, the following methods will be applied to the English-Spanish UNICEF collection: DICT, DICT+Wordnet, GVSM, PRF, LSI. In the second part of the project, an automatically generated corpus will be used for methods requiring traning on bilingual corpora. The corpus will be obtained from the translation service provided on the Web by AltaVista which is based on the Systran MT system.

Proposal and Timelines

Task

to be done by

status

Set up the web page March 3rd done
Investigate the Web translation engine March 6th done
Code for obtaining translations form the Web March 10th done
Translation of the UNICEF corpus March 20th done
Investigate and run DICT on UNICEF March 17th done
Investigate and run GVSM on UNICEF March 17th done
Investigate and run LSI on UNICEF March 22th done
Investigate and run PRF of UNICEF March 25th done
Corpus clean-up March 30th done
Implement query expansion using WordNet April 5th done
Rerun the experiments with PRF, LSI, GVSM with the generated corpus April 15th done
Write-up, presentation April 24th done

Results

Final report

Conclusions

The results from the previous section suggest that corpus-based methods for TLIR can be used on automatically translated corpora to achieve retrieval quality better than by the dictionary method. In particular, it is interesting that the performance can be as good as 97.3% as compared to the human-translated corpus (PRF with WordNet query expansion).

It is also interesting that the WordNet query expansion strategy produced such good results. WN-ALL is the best set of queries. However, since it helped both in the case of the human and the machine-translated corpus, it is not clear how helpful it was in alleviating the paraphrase problem. An analysis of single queries could help answer this question.

Another question which remains unclear is whether the improvement over the dictionary method is actually due to the use of the MT system. SYSTRAN's lexicon is impressive in terms of coverage and the better results might be just a consequence. One way to verify that would be to obtain a bilingual dictionary from the one used by SYSTRAN and see how well the dictionary method performs using it. However, the results of the tests with GVSM and LSI suggest that there might be benefits to be gained from using the MT translated corpus over a dictionary.

As for further work, it would be interesting to see the results on the sentence-aligned corpus. Also, it would be very interesting to see how the MT-based methods would perform on a different collection. MEDLINE is available, but it might not be the right candidate due to the technical terminology it contains.

Also, running the dictionary method with a dictionary based on the SYSTRAN lexicon would be a very instructive experiment but it is probably not very likely that such a dictionary could be obtained.


last update: April 24th, 1998