HomeFeedback
SMART
Sinopsis
SMART software
Description
The smart package is not provided with full documentation, and since most of its features are not going to be used in the course we only comment the most relevant ones.
Tutorial
Installing Smart
Usage Information
Example
Weighting schemes and nomenclature
Tutorial
If you are just begining this
tutorial and
some other
information including a Smart overview might be helpfull (written & maintained by Hans Paijmans)
Installing Smart
1. Smart has been compiled for Sun and Linux. If you are using a sun workstation, include "/afs/cs.cmu.edu/academic/class/11741-s98/bin/" in your PATH environment variable.
If you are using linux /afs/cs.cmu.edu/academic/class/11741-s98/linux/bin
If you already have an account in oslo and to reduce network trafic we recomend you to use: "/usr9/ir/tomp/ir11741/bin" and "/usr9/zechner/ir-course/smart/scripts"
2.Create a smart-subdirectory in your space
3.You can copy the following 3 files from the above mentioned start-up directory to this directory: Makefile, Word.run, Indexing. The makefile is just a way of keeping record of all the procedures/dependencies needed in your exercise. The other are more like wrapers.
Usage Information
You can run Smart most conveniently using the Makefile, look at it for more details.
When you inspect the Makefile, you will find several options that you can change easily.
Links to the corpus and query files (docs.smart, queries.smart).
Also, this makefile generates a "spec" file that you can modify directly (look at it for some documentation). To do more elaborate changes like:
Describing a new format/new data base
Indexing only some sections (eg titles) of it
To run Smart using the Makefile, type: gmake fullset
This will perform indexing, querying and evaluation of the documents in docs.smart using the queries in queries.smart.
To get a better feel what happens in Smart, you can use a smaller query-file, compare the top-N
documents returned by different queries, or/and look at the intermediate files (e.g. the document-vectors...) which Smart creates. For the latter, you have to disable the "tidy"
option/parameter in the Makefile and to make Smart print out its files in readable text (see below).
If you look at Word.run, you see that it can be called with parameters that determine the usage of stoplists or stemming. For this part of the homework, you have to modify the Makefile accordingly where Word.run is invoked to get the desired results.
Printing (Converting) Smart Files
Generally. Smart files are binaries and hence unreadable. But you can print/convert them to see what is really going on inside of Smart.
Use the command: smprint <"type"> <"file">
(this will print to stdout)
where is either dict (to print the dictionary file dict), vec (to print document or
query vectors, e.g., doc.nnn; a .var file is needed in this case), or tr_vec (to print retrieval results, e.g., tr.nnn.nnn).
Note that most of these tables show the word-IDs and not the words. You have to look at the (converted) dict file to see which words have which IDs.
Some examples:
smprint vec doc.ltc
will print a table with
[doc_id zero term_id value]
smprint tr_vec tr.ltc.ltc
will print a table with
[query doc_id rank zero zero zero similarity_value]
Example
Lets use SMART to check how stemming and stopwords affect ther retrieval performance on the MEDLARS corpus.
gmake fullset
The Makefile provided, will do most of the job.
You should have a number of files in your directory.
check "word_1_1.ltc.ltc.n1121.eval" for the evaluation results.
Example
For the UNICEF corpus (1121 docs, 29 queries), using ltc weighting.
Steming | stop-words | 11-pt prec |
No | No | 0.4076
|
No | yes | 0.4094
|
yes | No | 0.4288
|
yes | yes | 0.4305
|
Do the same kind of tests for MEDLARS corpus!
Weighting schemes & nomenclature
the 3 character identification used for the documents and query
weighting schemes comes from.
term_freq:
b: binary (always 1)
a: term_freq normalized between 0.5 and 1.0 (i.e., 0.5 +0.5*tf/max_tf_in_doc)
l: 1 + ln(term_freq)
n: term_freq (i.e number of times term occurs in doc)
idf:
t: ln(N/n) where N=no. docs in collection and n=no. docs in which term occurs.
n: no idf factor
normalization:
c: cosine normalization.
n: no normalization.
Links
If for some reason you feel like using bandwith and not using the copies at LTI, the original files are in ftp://ftp.cs.cornell.edu/pub/smart/