IR term project of Liren Chen

Basic Information

Project Title:Compare Different Approaches of Query Expansion Based on Different Monolingual Text Retrieval Models
Name: Liren Chen (email: lchen@cs.cmu.edu)
Presentation Date: April 28
Demo Date: April 28

Abstract
Proposal and Timelines
System Description
Experiments
Results
Demo
Final Report

Abstract

There are different Monolingual Text Retrieval Models (VSM, GVSM, LSI, etc.), and different query expansion techniques (Pseudo Relevance Feedback, Relevance Feedback, Dictionary, Words Co-occurrence, etc.). In this project, we first explored the query expansion based on the ordered words co-occurrences. Then we proposed a new retrieval model -- Context Model -- based on the word similarity, and did some experiments on it. The results show that this model is good for short queries because the context of the words is considered in this model. After that, we designed lots of experiments regarding the different query expansion techniques applied to the different monolingual text retrieval models. The results show that the RF can improve the perfor mance of almost all the models, the other expansion techniques depend on the retrieval models, the amount of the expansion, and the collections (or domains).

Proposal and Timelines

Click here to see the proposal

Timelines Table

Task	to be done by	status
Programming for stemming, deleting stopwords, calcuating the frequency of terms	Mar.3	Done
Programming for extracting the ordered word co-occurrence	Mar.10	Done
Design and implement the algorithm for query expansion based on SMART system	Mar.17	Also Finished PRF (Programmed by myself because when I do experiment, the class directory doesn't have the PRF program)
Design and implement the algorithm for query expansion based on WordNet and SMART system	Mar.24	Finished prototype demo system
Work on PRF, GVSM and LSI	April.7	Finished Relevance Feedback Method, query expansion using WordNet
Design and implement the Web interface for query expansion to some search engines	April.14	Generate Relevance Feedback Queries for test on GVSM and LSI. Because LSI can't run in my machine. Thank Xin Liu to help me run these experiments.
Write final report and prepare for presentation & demo	April.21	Done.

System Description

1. Ordered words co-occurrences and new retrieval model -- Context Model a) Method to calculate the ordered words co-occurrences. Most of the work about the words co-occurrences are based on the documents, that is, a word can be represented as a set of documents; or based on the co-occurrences in a fixed sized window, no order is considered, that is, (s, t) is the same as (t, s). Since we know the order of the words con tributes the meaning of the sentence, we use the ordered words co-occurrences instead of the unordered words co-occurrences. The mutual information (MI) that considers the words order is a measure of the correlation and used to extract the ordered words co-occurrences from larger cor pus. The mutual information is given by the following formula: Before actually calculating the ordered words co-occurrences, the documents are preprocessed including the deletion of the stopwords, stemming. The window size we used is 1000 words (500 words before and 500 words after the t). The ordered words co-occurrences are domain depen dent. Some examples are listed in the appendix A, B, C. Then all the co-occurrences are sorted in decreasing order of the mutual information and put in to the ordered words co-occurrences dictio nary.

b) Query expansion algorithm based on the ordered words oc-occurrences. Let us assume that the keywords are K1, K2, ..., Km, and expected number of refinement words is N which can be defined in the experiment. S1={s | (s, K1) is an ordered words co-occurrence in the dictionary}, ..., Sm={s | (s, Km) is an ordered words co-occurrences in the dictionary}. Find the intersection among K1, K2, ..., Ki. (i <=m). If the intersection is not empty, use the words in it as the expansion words. Or let i decrease, and continue to find the intersections until the num ber of expansion words is equal N. This algorithm is used for different retrieval models.

c) New retrieval model -- Context Model -- based on the word similarity Mutual Information can be used to measure the similarity between the words. Our new retrieval model utilizes the similarity as the relevance measure. Say, the document d = W1W2...Wn, the query q=Q1Q2...Qm. The motivation for this model is that we think the context of the word contributes some weight to the relevance between the query and the document.

2. Relevance feedback From the relevance judgement file, randomly get the pre-defined number (which can be set in the program) of documents for every initial query. Then the new queries file is created and is tested on the different retrieval models.

3. Pseudo Relevance feedback Run the initial queries on the different retrieval models, extract the top N ranked documents, expand the initial queries, and finally run the different retrieval models again.

4. WordNet based query expansion For the initial query, first, delete the stopwords, then look up every word in the WordNet dictio nary in terms of Noun, Verb, Adj/Adv, and finally form the new query by combining the initial query and all the synonyms. Actually we can improve it by disambiguating the POS and sense.

5. Interactive query expansion (IQE) We found the ordered words co-occurrences reflect the semantics between the words and the con tent of the documents collection. That's saying, from the ordered words co-occurrences, we know something about what are mentioned in the documents collection regarding to the current key words. So we think this ordered words co-occurrences are better used as the interactive query expansion. The user can choose which word or concept is related to his intended query. According to this, we developed a prototype IQE system based on the Wall Street Journal corpus and Broad cast News corpus. It is seamless integrated with the several popular search engines such as Yahoo, Excite, Lycos, Altavista.

Experiments

1. Calculate the ordered words co-occurrences from Medlar, Broadcast News corpus (140M tokens), and Wall Street Journal corpus (1M tokens). In this experiment, the parameter is how large the window size should be used. We tried 10, 50, 100, 250, 500, 1000, 1200, then evaluated the ordered words occurrences, it seems 1000 is reasonable in order to balance the computational time and quality of the co-occurrences. There are 3 (corpus)*7 (parameter) =21 experiments.

2. Test the new retrieval model -- Context Model -- on Medlar. First, we test the average per formance on all the queries. The results are not encourage. Then we tested the performance on individual query and compared with VSM (ltc). The results showed that it improved on most of the short queries. There are 30 (query) *2 (approach: VSM and new Method) = 60 experiments for the individual query.

3. The ordered words co-occurrences as a query expansion method using our designed algo rithm on VSM, GVSM and LSI. How many words should be used to expand the query is changeable and we tried the 1 to 10, 15, 20 words. Medlar collection is used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsification is 80. For the LSI, the number of singular value is 300. There are 3 (models) * 12 (parameter) = 36 experi ments.

4. WordNet based query expansion method on VSM, GVSM and LSI. All the synonyms are used to expand the query no matter which POS it is. Actually we may test on only using Noun, or only using Verb, or only using Adj/Adv to expand the query. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsi fication is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 2 (collec tions) = 6 experiments.

5. Relevance feedback on VSM, GVSM and LSI. How many relevant documents are used as the relevance feedback is the parameter of these experiments. We test on 1 to 10, 15 documents and they are randomly picked from the relevance judgement. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. or the GVSM, the sparsifi cation is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 11 (parame ter) * 2 (collections) = 66 experiments.

6. Pseudo relevance feedback on VSM, GVSM and LSI. How many top ranked documents are used as pseudo relevance is the parameter of these experiments. We choose the top 1 to 10, 15 documents. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsification is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 11 (parameter) * 2 (collections) = 66 experiments.

There are more than 250 experiments designed and done in this projects.

Results

1. The ordered words co-occurrences for Medlar.
2. The ordered words co-occurrences for Wall Street Journal.
3. The ordered words co-occurrences for Broadcast News corpus.
4. The average performance of the Context Model on Medlar.
5. The performance of the Context Model on individual query on Medlar.
6. The ordered words co-occurrence query expansion on VSM for Medlar.
7. The ordered words co-occurrence query expansion on GVSM for Medlar.
8. The ordered words co-occurrence query expansion on LSI for Medlar.
9. WordNet based query expansion on VSM for Medlar and Unicef.
10. WordNet based query expansion on GVSM for Medlar and Unicef.
12. WordNet based query expansion on LSI for Medlar and Unicef.
12. Relevance feedback on VSM for Medlar and Unicef.
13. Relevance feedback on GVSM for Medlar and Unicef.
14. Relevance feedback on LSI for Medlar and Unicef.
15. Pseudo relevance feedback on VSM for Medlar and Unicef.
16. Pseudo relevance feedback on GVSM for Medlar and Unicef.
17. Pseudo relevance feedback on LSI for Medlar and Unicef.