Click here to see the proposal
Timelines Table
Task |
to be done by |
status |
Programming for stemming, deleting stopwords, calcuating the frequency of terms | Mar.3 | Done |
Programming for extracting the ordered word co-occurrence | Mar.10 | Done |
Design and implement the algorithm for query expansion based on SMART system | Mar.17 | Also Finished PRF (Programmed by myself because when I do experiment, the class directory doesn't have the PRF program) |
Design and implement the algorithm for query expansion based on WordNet and SMART system | Mar.24 | Finished prototype demo system |
Work on PRF, GVSM and LSI | April.7 | Finished Relevance Feedback Method, query expansion using WordNet |
Design and implement the Web interface for query expansion to some search engines | April.14 | Generate Relevance Feedback Queries for test on GVSM and LSI. Because LSI can't run in my machine. Thank Xin Liu to help me run these experiments. |
Write final report and prepare for presentation & demo | April.21 | Done. |
1. Ordered words co-occurrences and new retrieval model -- Context Model a) Method to calculate the ordered words co-occurrences. Most of the work about the words co-occurrences are based on the documents, that is, a word can be represented as a set of documents; or based on the co-occurrences in a fixed sized window, no order is considered, that is, (s, t) is the same as (t, s). Since we know the order of the words con tributes the meaning of the sentence, we use the ordered words co-occurrences instead of the unordered words co-occurrences. The mutual information (MI) that considers the words order is a measure of the correlation and used to extract the ordered words co-occurrences from larger cor pus. The mutual information is given by the following formula: Before actually calculating the ordered words co-occurrences, the documents are preprocessed including the deletion of the stopwords, stemming. The window size we used is 1000 words (500 words before and 500 words after the t). The ordered words co-occurrences are domain depen dent. Some examples are listed in the appendix A, B, C. Then all the co-occurrences are sorted in decreasing order of the mutual information and put in to the ordered words co-occurrences dictio nary.
b) Query expansion algorithm based on the ordered words oc-occurrences. Let us assume that the keywords are K1, K2, ..., Km, and expected number of refinement words is N which can be defined in the experiment. S1={s | (s, K1) is an ordered words co-occurrence in the dictionary}, ..., Sm={s | (s, Km) is an ordered words co-occurrences in the dictionary}. Find the intersection among K1, K2, ..., Ki. (i <=m). If the intersection is not empty, use the words in it as the expansion words. Or let i decrease, and continue to find the intersections until the num ber of expansion words is equal N. This algorithm is used for different retrieval models.
c) New retrieval model -- Context Model -- based on the word similarity Mutual Information can be used to measure the similarity between the words. Our new retrieval model utilizes the similarity as the relevance measure. Say, the document d = W1W2...Wn, the query q=Q1Q2...Qm. The motivation for this model is that we think the context of the word contributes some weight to the relevance between the query and the document.
2. Relevance feedback From the relevance judgement file, randomly get the pre-defined number (which can be set in the program) of documents for every initial query. Then the new queries file is created and is tested on the different retrieval models.
3. Pseudo Relevance feedback Run the initial queries on the different retrieval models, extract the top N ranked documents, expand the initial queries, and finally run the different retrieval models again.
4. WordNet based query expansion For the initial query, first, delete the stopwords, then look up every word in the WordNet dictio nary in terms of Noun, Verb, Adj/Adv, and finally form the new query by combining the initial query and all the synonyms. Actually we can improve it by disambiguating the POS and sense.
5. Interactive query expansion (IQE)
We found the ordered words co-occurrences reflect the semantics between the words and the con
tent of the documents collection. That's saying, from the ordered words co-occurrences, we know
something about what are mentioned in the documents collection regarding to the current key
words. So we think this ordered words co-occurrences are better used as the interactive query
expansion. The user can choose which word or concept is related to his intended query. According
to this, we developed a prototype IQE system based on the Wall Street Journal corpus and Broad
cast News corpus. It is seamless integrated with the several popular search engines such as Yahoo,
Excite, Lycos, Altavista.
1. Calculate the ordered words co-occurrences from Medlar, Broadcast News corpus (140M tokens), and Wall Street Journal corpus (1M tokens). In this experiment, the parameter is how large the window size should be used. We tried 10, 50, 100, 250, 500, 1000, 1200, then evaluated the ordered words occurrences, it seems 1000 is reasonable in order to balance the computational time and quality of the co-occurrences. There are 3 (corpus)*7 (parameter) =21 experiments.
2. Test the new retrieval model -- Context Model -- on Medlar. First, we test the average per formance on all the queries. The results are not encourage. Then we tested the performance on individual query and compared with VSM (ltc). The results showed that it improved on most of the short queries. There are 30 (query) *2 (approach: VSM and new Method) = 60 experiments for the individual query.
3. The ordered words co-occurrences as a query expansion method using our designed algo rithm on VSM, GVSM and LSI. How many words should be used to expand the query is changeable and we tried the 1 to 10, 15, 20 words. Medlar collection is used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsification is 80. For the LSI, the number of singular value is 300. There are 3 (models) * 12 (parameter) = 36 experi ments.
4. WordNet based query expansion method on VSM, GVSM and LSI. All the synonyms are used to expand the query no matter which POS it is. Actually we may test on only using Noun, or only using Verb, or only using Adj/Adv to expand the query. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsi fication is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 2 (collec tions) = 6 experiments.
5. Relevance feedback on VSM, GVSM and LSI. How many relevant documents are used as the relevance feedback is the parameter of these experiments. We test on 1 to 10, 15 documents and they are randomly picked from the relevance judgement. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. or the GVSM, the sparsifi cation is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 11 (parame ter) * 2 (collections) = 66 experiments.
6. Pseudo relevance feedback on VSM, GVSM and LSI. How many top ranked documents are used as pseudo relevance is the parameter of these experiments. We choose the top 1 to 10, 15 documents. Medlar and Unicef collections are used. For the VSM, ltc weight scheme is used for document and query. For the GVSM, the sparsification is 80. For the LSI, the number of sigular value is 300. There are 3 (models) * 11 (parameter) * 2 (collections) = 66 experiments.
There are more than 250 experiments designed and done in this projects.
1. The ordered words co-occurrences for Medlar.
2. The ordered words co-occurrences for Wall Street Journal.
3. The ordered words co-occurrences for Broadcast News corpus.
4. The average performance of the Context Model on Medlar.
5. The performance of the Context Model on individual query on Medlar.
6. The ordered words co-occurrence query expansion on VSM for Medlar.
7. The ordered words co-occurrence query expansion on GVSM for Medlar.
8. The ordered words co-occurrence query expansion on LSI for Medlar.
9. WordNet based query expansion on VSM for Medlar and Unicef.
10. WordNet based query expansion on GVSM for Medlar and Unicef.
12. WordNet based query expansion on LSI for Medlar and Unicef.
12. Relevance feedback on VSM for Medlar and Unicef.
13. Relevance feedback on GVSM for Medlar and Unicef.
14. Relevance feedback on LSI for Medlar and Unicef.
15. Pseudo relevance feedback on VSM for Medlar and Unicef.
16. Pseudo relevance feedback on GVSM for Medlar and Unicef.
17. Pseudo relevance feedback on LSI for Medlar and Unicef.