Course Project for 11-741 (Information Retrieval)

Basic Information

Project Title: Mono/Trans-lingual IR
Name:Xin Liu(email: xliu@cs.cmu.edu)
Presentation Date: Apr 30
Demo Date: ??

Abstract
Proposal and Timelines
System Description
Experiments
Results
Demo

Abstract

Traditional information retrieval methods, such as ad hoc, can not be effectively applied to Translingual IR, for the query and documents are in different languages and share few terms. Some dual-space methods have been investigated to achieve good performance. In this project, I am going to impletement two of these methods, GVSM (Generalized Vector Space Model) and LSI (Latent Sementic Indexing) in more efficient scalable way. After that, the results of both monolingual and translingual retrieval will be evaluated. The performance of both methods on different parameter setting will be compared. The main purpose of this project is to investigate the trade-off between time cost and performance in both methods, but it will also be valuable to compare the two dual-space methods with other methods, such as PRF (Pseudo-Relevance Feedback) and DICT-based method.

Proposal and Timelines

1. Specific Aims

The objective of this project is to apply GVSM and LSI methods in both Monolingual and Cross-Lingual Information Retrieval. The specific aims and methods are:

Implement GVSM and LSI methods in efficient and scalable way, focus on the complexity of singular value decomposition and matrix multiplication.

Parameter tuning, including the indexing weight schemes, stop words list, number of singular values used, analyze the performance (trade-off between complexity and 11-pt avgp value) of GVSM and LSI on different parameter settings.

Test GVSM and LSI TIR on the machine-translated UNICEF data (provided by Krzysztof), Compare the results of of both methods with the one human-translated.

Compare the results of GVSM and LSI with other methods, such as SMART+WordNet, PRF, DICT, etc.(The results of these methods will be provided by others)

2. Time lines

Time Span Goal Notes

Mar. 5 - Mar. 11 Analyze the SVDC Package source code,
modify it to use much less memory
(reduced by at least 50%) Achieved in this period of time.

Mar. 12 - Mar. 18 Use SMART to index English and Spanish
documents separately. Do programs to combine
them together, and sparsify the combined matrix
in CCS format, which is required by SVD. Achieved in this period of time.

Mar.19 - Mar. 28 Write programs to do matrix multiplication, and
esp. to deal with sparse matrices. Achieved in this period of time.

Mar.29 -Apr. 4 Write programs to implement LSI procedure.
Get some preliminary results. Achieved in this period of time.

Apr. 5- Apr. 11 Write programs to implement GVSM procedure.
Get some preliminary results of GVSM. Achieved in this period of time.

Apr. 12 - Apr. 18 Parameter tuning, result analysis and evaluation.
Run LSI and GVSM on Systrans Result (provided
by Krzysztof). Achieved in this period of time

Apr. 19 - Apr. 25 Run GVSM and LSI for other groups, exchange results
with Paul, Liren and Krzysztof, do comparisons. Achieved by Apr. 23

Apr. 25- Apr. 30 Get ready for presentation on Apr. 28. Expected.

System Description

See figure 1.1.

Experiments

Using different sparsification factors in GVSM, compare the results.
Using different number of singular values in LSI, compare the results.
Using different training sets (alignment on document, on paragraph, or on sentence), compare the results of both LSI and GVSM.
Do timing analyses on SVD procedure, using different number of singular values and iteration steps.
Do timing analyses on GVSM and LSI, using different sparsification factors or number of singular values.
Compare the results on Systrans data with previous ones.
Compare the GVSM and LSI results with Smart-WordNet, PRF and DICT.

Results

The results of GVSM using different sparsification factors.

(Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)

Sparsification 50 80 100 150 200 500 600 1134

MIR 11-avgp. 0.3941 0.4022 0.4026 0.3935 0.3845 0.3791 0.3770 0.3860

TIR 11-avgp. 0.3897 0.3663 0.3862 0.3713 0.3644 0.3674 0.3701 0.3751

The results of LSI using different number of singular values

(Doc_Weight = ntc, Qry_Weight = ntc, TrainingSet = DOC)

#Sig Values 50 80 100 200 300 500 600 1134

MIR 11-avgp. 0.3615 0.3927 0.3954 0.4275 0.4267 0.4357 0.4417 0.4576

TIR 11-avgp. 0.3626 0.3839 0.3967 0.4114 0.4145 0.4171 0.4233 0.4292

The results of GVSM using different training set

Training Set Sparsification 11-avgp of MIR 11-avgp of TIR

DOC 100 0.4026 0.3862

DOC 200 0.3845 0.3644

DOC 600 0.3770 0.3701

PARAGRAPH 200 0.4097 0.4013

PARAGRAPH 600 0.3968 0.3848

PARAGRAPH 1134 0.3955 0.3811

SENTENCE 200 0.3885 0.3883

SENTENCE 600 0.3796 0.3731

SENTENCE 1000 0.3722 0.3645

The results of LSI using different training set

Training Set Sparsification 11-avgp of MIR 11-avgp of TIR

DOC 100 0.395 0.3967

DOC 200 0.4275 0.4114

DOC 600 0.4417 0.4233

PARAGRAPH 200 0.3931 0.3835

PARAGRAPH 600 0.4367 0.4038

PARAGRAPH 1134 0.4345 0.4035

SENTENCE 200 0.3166 0.3022

SENTENCE 600 0.3946 0.3949

SENTENCE 1000 0.4244 0.4257

Demo

No demo.

Last update: Apr 21, 1998

Comments to: xliu@cs.cmu.edu

Time Span	Goal	Notes
Mar. 5 - Mar. 11	Analyze the SVDC Package source code, modify it to use much less memory (reduced by at least 50%)	Achieved in this period of time.
Mar. 12 - Mar. 18	Use SMART to index English and Spanish documents separately. Do programs to combine them together, and sparsify the combined matrix in CCS format, which is required by SVD.	Achieved in this period of time.
Mar.19 - Mar. 28	Write programs to do matrix multiplication, and esp. to deal with sparse matrices.	Achieved in this period of time.
Mar.29 -Apr. 4	Write programs to implement LSI procedure. Get some preliminary results.	Achieved in this period of time.
Apr. 5- Apr. 11	Write programs to implement GVSM procedure. Get some preliminary results of GVSM.	Achieved in this period of time.
Apr. 12 - Apr. 18	Parameter tuning, result analysis and evaluation. Run LSI and GVSM on Systrans Result (provided by Krzysztof).	Achieved in this period of time
Apr. 19 - Apr. 25	Run GVSM and LSI for other groups, exchange results with Paul, Liren and Krzysztof, do comparisons.	Achieved by Apr. 23
Apr. 25- Apr. 30	Get ready for presentation on Apr. 28.	Expected.

Sparsification	50	80	100	150	200	500	600	1134
MIR 11-avgp.	0.3941	0.4022	0.4026	0.3935	0.3845	0.3791	0.3770	0.3860
TIR 11-avgp.	0.3897	0.3663	0.3862	0.3713	0.3644	0.3674	0.3701	0.3751

#Sig Values	50	80	100	200	300	500	600	1134
MIR 11-avgp.	0.3615	0.3927	0.3954	0.4275	0.4267	0.4357	0.4417	0.4576
TIR 11-avgp.	0.3626	0.3839	0.3967	0.4114	0.4145	0.4171	0.4233	0.4292

Training Set	Sparsification	11-avgp of MIR	11-avgp of TIR
DOC	100	0.4026	0.3862
DOC	200	0.3845	0.3644
DOC	600	0.3770	0.3701
PARAGRAPH	200	0.4097	0.4013
PARAGRAPH	600	0.3968	0.3848
PARAGRAPH	1134	0.3955	0.3811
SENTENCE	200	0.3885	0.3883
SENTENCE	600	0.3796	0.3731
SENTENCE	1000	0.3722	0.3645

Course Project for 11-741 (Information Retrieval)

Basic Information

Contents

Abstract

Proposal and Timelines

1. Specific Aims

2. Time lines

System Description

Experiments

Results

Demo