Course Project for 11-741 (Information Retrieval)


Basic Information

Contents


Abstract

     Traditional information retrieval methods, such as ad hoc, can not be effectively applied to Translingual IR, for the query and documents are in different languages and share few terms. Some dual-space methods have been investigated to achieve good performance. In this project, I am going to impletement two of these methods, GVSM (Generalized Vector Space Model) and LSI (Latent Sementic Indexing) in more efficient scalable way. After that, the results of both monolingual and translingual retrieval will be evaluated. The performance of both methods on different parameter setting will be compared. The main purpose of this project is to investigate the trade-off between time cost and performance in both methods, but it will also be valuable to compare the two dual-space methods with other methods, such as PRF (Pseudo-Relevance Feedback) and DICT-based method. 

Proposal and Timelines

1. Specific Aims

The objective of this project is to apply GVSM and LSI methods in both Monolingual and Cross-Lingual Information Retrieval. The specific aims and methods are:  
  1. Implement GVSM and LSI methods in efficient and scalable way, focus on the complexity of singular value decomposition and matrix multiplication.
  2. Parameter tuning, including the indexing weight schemes, stop words list, number of singular values used, analyze the performance (trade-off between complexity and 11-pt avgp value) of GVSM and LSI on different parameter settings.
  3. Test GVSM and LSI TIR on the machine-translated UNICEF data (provided by Krzysztof), Compare the results of of both methods with the one human-translated.
  4. Compare the results of GVSM and LSI with other methods, such as SMART+WordNet, PRF, DICT, etc.(The results of these methods will be provided by others)

2. Time lines

 
Time Span  Goal  Notes 
Mar. 5 - Mar. 11 Analyze the SVDC  Package source code,  
modify it to use much less  memory  
(reduced by at least 50%)
Achieved in this period of time.
Mar. 12 - Mar. 18 Use SMART to index English and Spanish  
documents separately. Do programs to combine 
them together, and sparsify the combined matrix  
in CCS format, which is required by SVD.
Achieved in this period of time.
Mar.19 - Mar. 28 Write programs to do matrix multiplication, and  
esp. to deal with sparse matrices.
Achieved in this period of time.
Mar.29 -Apr. 4 Write programs to implement LSI procedure. 
Get some preliminary results.
Achieved in this period of time.
Apr. 5- Apr. 11 Write programs to implement GVSM procedure. 
Get some preliminary results of GVSM.
Achieved in this period of time.
Apr. 12 - Apr. 18 Parameter tuning, result analysis and evaluation. 
Run LSI and GVSM on Systrans Result (provided  
by Krzysztof).
Achieved in this period of time
Apr. 19 - Apr. 25 Run GVSM and LSI for other groups, exchange results 
with Paul, Liren and Krzysztof, do comparisons.
Achieved by Apr. 23
Apr. 25- Apr. 30 Get ready for presentation on Apr.  28. Expected.
 

System Description

See     figure 1.1.

Experiments

  1. Using different sparsification factors in GVSM, compare the results.
  2. Using different number of singular values in LSI, compare the results.
  3. Using different training sets (alignment on document, on paragraph, or on sentence), compare the results of both LSI and GVSM.
  4. Do timing analyses on SVD procedure, using different number of singular values and iteration steps.
  5. Do timing analyses on GVSM and LSI, using different sparsification factors or number of singular values.
  6. Compare the results on Systrans data with previous ones.
  7. Compare the GVSM and LSI results with Smart-WordNet, PRF and DICT.

Results

  1. The results of GVSM using different sparsification factors.
  2. (Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)
    Sparsification 50 80 100 150 200 500 600 1134
    MIR 11-avgp.  0.3941  0.4022  0.4026  0.3935  0.3845  0.3791  0.3770  0.3860 
    TIR 11-avgp.  0.3897  0.3663  0.3862  0.3713  0.3644  0.3674  0.3701  0.3751 
     
     
  3. The results of LSI using different number of singular values
  4. (Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)
    #Sig Values 50 80 100 200 300 500 600 1134
    MIR 11-avgp.  0.3615 0.3927  0.3954  0.4275  0.4267  0.4357  0.4417  0.4576
    TIR 11-avgp.  0.3626 0.3839  0.3967  0.4114  0.4145  0.4171  0.4233  0.4292
     
  5. The results of GVSM using different training set
  6.  
    Training Set Sparsification 11-avgp of MIR 11-avgp of TIR
    DOC 100 0.4026 0.3862
    DOC 200 0.3845 0.3644
    DOC 600 0.3770 0.3701
    PARAGRAPH 200 0.4097 0.4013
    PARAGRAPH 600 0.3968 0.3848
    PARAGRAPH 1134 0.3955  0.3811
    SENTENCE 200 0.3885  0.3883
    SENTENCE 600 0.3796  0.3731
    SENTENCE 1000 0.3722 0.3645
     
  7. The results of LSI using different training set
  8.  
    Training Set Sparsification 11-avgp of MIR 11-avgp of TIR
    DOC 100 0.395 0.3967
    DOC 200 0.4275 0.4114
    DOC 600 0.4417 0.4233
    PARAGRAPH 200 0.3931 0.3835
    PARAGRAPH 600 0.4367  0.4038
    PARAGRAPH 1134 0.4345  0.4035
    SENTENCE 200 0.3166  0.3022
    SENTENCE 600 0.3946  0.3949
    SENTENCE 1000 0.4244  0.4257
     

Demo

    No demo.


Last update: Apr 21, 1998

Comments to: xliu@cs.cmu.edu