The topic of the project is Translingual Information Retrieval using different retrieval methods and models. In the first part of the project, the following methods will be applied to the English-Spanish UNICEF collection: DICT, DICT+Wordnet, GVSM, PRF, LSI. In the second part of the project, an automatically generated corpus will be used for methods requiring traning on bilingual corpora. The corpus will be obtained from the translation service provided on the Web by AltaVista which is based on the Systran MT system.
Task |
to be done by |
status |
Set up the web page | March 3rd | done |
Investigate the Web translation engine | March 6th | done |
Code for obtaining translations form the Web | March 10th | done |
Translation of the UNICEF corpus | March 20th | done |
Investigate and run DICT on UNICEF | March 17th | done |
Investigate and run GVSM on UNICEF | March 17th | done |
Investigate and run LSI on UNICEF | March 22th | done |
Investigate and run PRF of UNICEF | March 25th | done |
Corpus clean-up | March 30th | done |
Implement query expansion using WordNet | April 5th | done |
Rerun the experiments with PRF, LSI, GVSM with the generated corpus | April 15th | done |
Write-up, presentation | April 24th | done |
Queries
WN-1: Queries expanded using the first WordNet sense
WN-ALL: Queries expanded using all WordNet senses
DICT
- Best configuration: ntc-ntc, no WordNet query expansion
Precision/recall curves and 11pt Avg for DICT
PRF
Best configuration with WordNet
- Human: WN-ALL, paragraphs, K = 5, E = 80, ntc-ntc;
- MT: WN-ALL, paragraphs, K = 5, E = 90, ntc-ntc;
Best configuration without WordNet
- human: paragraphs, K = 5, E = 80, ntc-ntc;
- MT: paragraphs, K = 5, E = 90, ntc-ntc;
GVSM (run by Xin Liu)
Best configuration with WordNet
- Human: paragraphs, SP = 300;
- MT: WN-ALL, documents, SP = 100,
Best configuration without WordNet
- human: paragraphs, SP = 100;
- MT: paragraphs, paragraphs, SP = 200;
GVSM: 11pt average precision as function of sparsification, documents
GVSM: 11pt ave. precision as function of sparsification, paragraphs
GVSM: Best performing configurations for GVSM
LSI (run by Xin Liu)
Best configuration with WordNet
- Human: documents, SV=200;
- MT: WN-ALL, paragraphs, SV = 300;
Best configuration without WordNet
- human: documetns, SV = 1000;
- MT: paragraphs, paragraphs, SV = 300;
LSI: 11pt ave. precision as function of #singular values, documents
LSI: 11pt ave. precision as function of #singular values, paragraphs
LSI: Best performing configurations
DICT | PRF | GVSM | LSI | |
no WordNet human | 0.2898 | 0.3757 | 0.4059 | 0.4333 |
no WordNet translations | -- | 0.3671 | 0.3772 | 0.3999 |
with WordNet human | 0.2909 | 0.4027 | 0.3894 | 0.4220 |
with WordNet translations | -- | 0.3786 | 0.3822 | 0.4162 |
It is also interesting that the WordNet query expansion strategy produced such good results. WN-ALL is the best set of queries. However, since it helped both in the case of the human and the machine-translated corpus, it is not clear how helpful it was in alleviating the paraphrase problem. An analysis of single queries could help answer this question.
Another question which remains unclear is whether the improvement over the dictionary method is actually due to the use of the MT system. SYSTRAN's lexicon is impressive in terms of coverage and the better results might be just a consequence. One way to verify that would be to obtain a bilingual dictionary from the one used by SYSTRAN and see how well the dictionary method performs using it. However, the results of the tests with GVSM and LSI suggest that there might be benefits to be gained from using the MT translated corpus over a dictionary.
As for further work, it would be interesting to see the results on the sentence-aligned corpus. Also, it would be very interesting to see how the MT-based methods would perform on a different collection. MEDLINE is available, but it might not be the right candidate due to the technical terminology it contains.
Also, running the dictionary method with a dictionary based on the SYSTRAN lexicon would be a very instructive experiment but it is probably not very likely that such a dictionary could be obtained.