Traditional information retrieval methods, such as ad hoc, can not be effectively applied to Translingual IR, for the query and documents are in different languages and share few terms. Some dual-space methods have been investigated to achieve good performance. In this project, I am going to impletement two of these methods, GVSM (Generalized Vector Space Model) and LSI (Latent Sementic Indexing) in more efficient scalable way. After that, the results of both monolingual and translingual retrieval will be evaluated. The performance of both methods on different parameter setting will be compared. The main purpose of this project is to investigate the trade-off between time cost and performance in both methods, but it will also be valuable to compare the two dual-space methods with other methods, such as PRF (Pseudo-Relevance Feedback) and DICT-based method. |
1. Specific AimsThe objective of this project is to apply GVSM and LSI methods in both Monolingual and Cross-Lingual Information Retrieval. The specific aims and methods are: |
Time Span Goal Notes Mar. 5 - Mar. 11 Analyze the SVDC Package source code,
modify it to use much less memory
(reduced by at least 50%)Achieved in this period of time. Mar. 12 - Mar. 18 Use SMART to index English and Spanish
documents separately. Do programs to combine
them together, and sparsify the combined matrix
in CCS format, which is required by SVD.Achieved in this period of time. Mar.19 - Mar. 28 Write programs to do matrix multiplication, and
esp. to deal with sparse matrices.Achieved in this period of time. Mar.29 -Apr. 4 Write programs to implement LSI procedure.
Get some preliminary results.Achieved in this period of time. Apr. 5- Apr. 11 Write programs to implement GVSM procedure.
Get some preliminary results of GVSM.Achieved in this period of time. Apr. 12 - Apr. 18 Parameter tuning, result analysis and evaluation.
Run LSI and GVSM on Systrans Result (provided
by Krzysztof).Achieved in this period of time Apr. 19 - Apr. 25 Run GVSM and LSI for other groups, exchange results
with Paul, Liren and Krzysztof, do comparisons.Achieved by Apr. 23 Apr. 25- Apr. 30 Get ready for presentation on Apr. 28. Expected.
See figure 1.1.
(Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)
Sparsification 50 80 100 150 200 500 600 1134 MIR 11-avgp. 0.3941 0.4022 0.4026 0.3935 0.3845 0.3791 0.3770 0.3860 TIR 11-avgp. 0.3897 0.3663 0.3862 0.3713 0.3644 0.3674 0.3701 0.3751
(Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)
#Sig Values 50 80 100 200 300 500 600 1134 MIR 11-avgp. 0.3615 0.3927 0.3954 0.4275 0.4267 0.4357 0.4417 0.4576 TIR 11-avgp. 0.3626 0.3839 0.3967 0.4114 0.4145 0.4171 0.4233 0.4292
Training Set Sparsification 11-avgp of MIR 11-avgp of TIR DOC 100 0.4026 0.3862 DOC 200 0.3845 0.3644 DOC 600 0.3770 0.3701 PARAGRAPH 200 0.4097 0.4013 PARAGRAPH 600 0.3968 0.3848 PARAGRAPH 1134 0.3955 0.3811 SENTENCE 200 0.3885 0.3883 SENTENCE 600 0.3796 0.3731 SENTENCE 1000 0.3722 0.3645
Training Set Sparsification 11-avgp of MIR 11-avgp of TIR DOC 100 0.395 0.3967 DOC 200 0.4275 0.4114 DOC 600 0.4417 0.4233 PARAGRAPH 200 0.3931 0.3835 PARAGRAPH 600 0.4367 0.4038 PARAGRAPH 1134 0.4345 0.4035 SENTENCE 200 0.3166 0.3022 SENTENCE 600 0.3946 0.3949 SENTENCE 1000 0.4244 0.4257
No demo.