IR 11-741 - Jade Goldstein's Term Project

MLIR and Summarization

This page provides an overview of Jade Goldstein's term project for IR 11-741 Spring 1998. There are many interesting term projects in this IR course.

GOAL:

The purpose of this project is to evaluate query-relevant summaries of short documents (less than 300 lines) produced using different weighting methods in SMART (nnc, ntc, ltc, atc) and through the use of different query expansion algorithms namely a version of PRF (smart.prf) and query expansion using wordnet (smart.qexp). For further information, refer to the project proposal.

An analysis of the relevance of this project was written for the midterm.

Due Dates
Preliminary Results
Final Report
Final Presentation
Demos

Dates

Thu Feb 26: Project Proposal
Mon Mar 9: Implement Code for evaluation of Summarization Systems
Mon Mar 16: Finish Comparison of Weighting Methods in Smart
Tue Mar 17: Preliminary Results
Thurs Mar 19: Project Analysis
Mon Apr 6: Finish coding of smart.prf
Tue Apr 14: Decision about Demo
Mon Apr 15: Finish coding of smart.qexp
Mon Apr 20: Project Writeup Complete
Thu Apr 23: Final Project Report
Mon Apr 27: Tue Apr 28: Project Demos
Thu Apr 30: Project Presentation

Task	By	Status
Implement Code for Evaluation of Summarization Systems	Mar 9	done
Compare Weighting Methods in Smart	Mar 16	done
Code smart.prf	Apr 6	done
Code smart.qexp	Apr 15	done
Running Final Results	current	in progress

Preliminary Results

A very preliminary preliminary results report was written on March 17th. In this evaluation, we used the standard SMART defined 11-pt average precision curves, but instead of a document we used a sentence as the unit of measure. We compared for all documents that had relevant sentences (40 out of the 50) and the non-interpolated results are shown in the following files: different SMART weightings, different threshold, and MMR reranking. We did the non-interpolated measures using only the relevant documents (15 out of 50) different SMART weightings, different threshold, and MMR reranking for all documents.

From the results, we can see a definite increase in the scores for only the relevant documents. However, this methodology with the non-interpolated results is not suitable due to the short n due to the small number of relevant sentences in these "short" TREC documents that we are evaluating. The reason for this can be seen in the case of 3 relevant sentences returned and 3 relevant sentences for the document. In this case we would have the first correct sentence fall in interval (bin) 3 (1/3), the second in bin 6 (2/3) and the third in bin 10 (1/1). Thus, we will have lower recall/precision scores for the first interval (0) as seen in the graphs, and interval 10 will be inflated compared to 8 and 9 (which have less chance of having sentences falling in their categories). Since the median number of relevant sentences for these 40 documents is 5, and for the 15 relevant documents (7 or 8), the majority of the documents suffer from the situation where there is not enough coverage in the bins.

The first proposal was to just divide each bin by the number of elements contributing that particular bin. We also tried just using 3-pt non-interpolated average precision curves. The results for these are shown in the following files: SMART weightings 11-pt, SMART weightings relevant documents only 11-pt, SMART weightings 3-pt, and SMART weightings relevant documents only 3-pt. As we can see from these graphs, in the case of relevant documents 3-pt, where the median is high, 3 bins were sufficient to form a curve without such dramatic dips, but for the 11-pt relevant documents, we still obtained the dips.

The problem with using just the number of elements contributing to the bin is that it artifically raises scores especially at the tail end of the curve where one would expect the curve to significantly decrease if full recall is not met. Since using the 3-pt average precision curves would eliminate important data, we propose a new measure. For the cases where the number of relevant sentences would not fill the bins, i.e., the number of relevant sentences are less than the number of intervals, 11, we will use a step function to fill in the missing bins. We believe that the step function accurately describes what is happening in these cases and thus this modified measure is appropriate. Note that the interpolated case automatically solves the problem of missing bins in this same manner, however the interpolated measure can also inflate the curve in the cases of lower precision in low recall intervals and we are interested in tightly evaluating the performance of our different weighting methods and systems.

Final Report

Coming all too soon - to be submitted on Thursday April 23rd.

Project Presentation

Project presentation is Thursday April 30th at 1:30 in Blue Conference Room.

Demo

Demo on Tuesday April 28th.

This page is maintained by Jade Goldstein (jade@cs.cmu.edu)

Last updated Tuesday, April 14th.