SMART

Sinopsis

SMART software

Description

The smart package is not provided with full documentation, and since most of its features are not going to be used in the course we only comment the most relevant ones.
Tutorial Installing Smart Usage Information Example Weighting schemes and nomenclature

Tutorial

If you are just begining this tutorial and some other information including a Smart overview might be helpfull (written & maintained by Hans Paijmans)

Installing Smart

1. Smart has been compiled for Sun and Linux. If you are using a sun workstation, include "/afs/cs.cmu.edu/academic/class/11741-s98/bin/" in your PATH environment variable.
If you are using linux /afs/cs.cmu.edu/academic/class/11741-s98/linux/bin
If you already have an account in oslo and to reduce network trafic we recomend you to use: "/usr9/ir/tomp/ir11741/bin" and "/usr9/zechner/ir-course/smart/scripts"
2.Create a smart-subdirectory in your space
3.You can copy the following 3 files from the above mentioned start-up directory to this directory: Makefile, Word.run, Indexing. The makefile is just a way of keeping record of all the procedures/dependencies needed in your exercise. The other are more like wrapers.

Usage Information

You can run Smart most conveniently using the Makefile, look at it for more details.
When you inspect the Makefile, you will find several options that you can change easily.

Links to the corpus and query files (docs.smart, queries.smart).

Term weighting options.

Stemming (or not).

Stopwords (or not). Also, this makefile generates a "spec" file that you can modify directly (look at it for some documentation). To do more elaborate changes like:

Describing a new format/new data base

Indexing only some sections (eg titles) of it To run Smart using the Makefile, type: gmake fullset
This will perform indexing, querying and evaluation of the documents in docs.smart using the queries in queries.smart.
To get a better feel what happens in Smart, you can use a smaller query-file, compare the top-N documents returned by different queries, or/and look at the intermediate files (e.g. the document-vectors...) which Smart creates. For the latter, you have to disable the "tidy" option/parameter in the Makefile and to make Smart print out its files in readable text (see below).
If you look at Word.run, you see that it can be called with parameters that determine the usage of stoplists or stemming. For this part of the homework, you have to modify the Makefile accordingly where Word.run is invoked to get the desired results.
Printing (Converting) Smart Files
Generally. Smart files are binaries and hence unreadable. But you can print/convert them to see what is really going on inside of Smart.
Use the command: smprint <"type"> <"file"> (this will print to stdout)
where is either dict (to print the dictionary file dict), vec (to print document or query vectors, e.g., doc.nnn; a .var file is needed in this case), or tr_vec (to print retrieval results, e.g., tr.nnn.nnn). Note that most of these tables show the word-IDs and not the words. You have to look at the (converted) dict file to see which words have which IDs. Some examples: smprint vec doc.ltc will print a table with [doc_id zero term_id value] smprint tr_vec tr.ltc.ltc will print a table with [query doc_id rank zero zero zero similarity_value]

Example

Lets use SMART to check how stemming and stopwords affect ther retrieval performance on the MEDLARS corpus.
gmake fullset
The Makefile provided, will do most of the job. You should have a number of files in your directory.
check "word_1_1.ltc.ltc.n1121.eval" for the evaluation results.
For the UNICEF corpus (1121 docs, 29 queries), using ltc weighting.

Example
Steming	stop-words	11-pt prec
No	No	0.4076
No	yes	0.4094
yes	No	0.4288
yes	yes	0.4305

Do the same kind of tests for MEDLARS corpus!

Weighting schemes & nomenclature

the 3 character identification used for the documents and query weighting schemes comes from.

term_freq:
b: binary (always 1)
a: term_freq normalized between 0.5 and 1.0 (i.e., 0.5 +0.5*tf/max_tf_in_doc)
l: 1 + ln(term_freq)
n: term_freq (i.e number of times term occurs in doc)

idf:
t: ln(N/n) where N=no. docs in collection and n=no. docs in which term occurs.
n: no idf factor

normalization:
c: cosine normalization.
n: no normalization.

Links

If for some reason you feel like using bandwith and not using the copies at LTI, the original files are in ftp://ftp.cs.cornell.edu/pub/smart/