The purpose of this project is to measure the effectiveness and usability of SMART in obtaining the desired information from a text database compared with the performance of a conventional keyword search. The effectiveness and usability of SMART is measured by a user attempting to retrieve information from the CS zephyr archives. The goal of this project is not to evaluate SMART or compare the two on identical searches, but to measure the same people's ability to retrieve their desired information through each interface.
This project is designed to evaluate the usefulness of two different systems for searching an archive available to members of the CS department at Carnegie Mellon. I will evaluate the use of the SMART information retrieval package compared to simple keyword search on queries done using the archives. The goal of the study is not to evaluate SMART, in particular, but to measure the ability of users to find what they are looking for using more advanced search techniques than keyword search.
Zephyr is a real-time communication system, similar to IRC, which allows members of the CS department to have conversations about a variety of topics, organized according to topics referred to as "instances." Typical uses of zephyr are to ask questions about computer related issues, such as help using an application, discussion of recent newsworthy events, and general conversation.
Beginning in August 1993, most public broadcast zephyrgrams have been logged for archival purposes. A simple keyword based query mechanism is available to search the archives for information. For instance, rather than asking how to build shared libraries for a DEC alpha, a query of "build shared library osf" could be made, although it won't produce the best result ("shared library *osf*" does). The uses to which the current interface is put make it ideal for exploring more advanced search techniques.
Task |
to be done by |
status |
Human Subjects Clearance Request | ASAP | COMPLETED |
Web Page Initial Design | March 3 | COMPLETED |
Obtain SMART for DEC ALPHA platform | March 6 | COMPLETED |
Software Setup with Zephyr archive | March 10 | COMPLETED |
Preliminary Interface for queries | March 16 | COMPLETED |
Preliminary Results and Report | March 17 | COMPLETED |
Complete Web Interface with Relevance Judgements and Feedback | March 19 | COMPLETED |
Begin Recording Usage | March 31 | COMPLETED |
Data Collection Completed | April 21 | Surveys completed |
Complete Evaluation, ensure enough relevance info | April 22 | Nine surveys completed and tabulated |
Report Due | April 23 | Completed |
Presentation | April 28 | Completed |
The zephyr archives have been adapted to a format usable by SMART and are currently accessible through an html interface used for the experiments. The web interface allows the users to indicate relevant zephyrgrams retrieved for use in relevance feedback and to provide information on which zephyrgrams are relevant to a successful query. The SMART zephyr query interface is accessible.
The experimental platform is SMART 11.0 running on a Digital Personal Workstation 500au, which has a 500Mhz Alpha processor with 384MB of RAM and is using approximately 1.7GB of disk space for the project. The original zephyr archives themselves are currently approximately 190MB of data with over 700000 zephyrgrams. Building the index and calculating weights consumes the majority of the disk space and takes approximately 15 minutes, however searches over the resulting index complete in less than a second, even for queries involving 20 or more words. This is an improvement over the existing keyword search engine which is fast for small queries, but which takes several times longer for longer queries (the exact cost function is unknown to me).
Adjusting the system to achieve reasonable precision proved problematic. The keyword interface supports searches on the body, sender, and instance (topic label) of the zephyrgrams. SMART was originally configured in this manner, but this proved problematic because relevance feedback would give equal weight to similarity on sender and instance fields as on the zephyrgram body. This behavior did not model expectations of searching for other zephyrs on a similar topic, so the indexing of sender and instance was removed for this project, until the relevance feedback can be improved in this regard. A second problem was that a standard query weight, such as "ltc" gave short documents preference over longer ones. Frequently a zephyrgram with a single word in its body, which happened to be in the query, would be returned as the most relevant document. Using a query weight of "mtn" was empirically determined to be the best choice. This technique overemphasizes long zephyrgrams, but experience indicates that this is certainly more desirable than overweighting short zephyrgrams, which are frequently content free. Log based normalization would be a useful feature to add to compensate for this behavior.
The original experimental design called for a preliminary phase of analyzing the results of informal queries brought by users to the system. However, a lack of participants volunteering their time for this project prevented this from being effective. Instead, participants were allowed to familiarize themselves with the system as they desired and no results were collected. Formal evaluation was done by providing each participant with eight specific topics to search for in the database. For each topic, the participant first designed a query for the familiar keyword search system. This query was recorded but not performed, so the participant was not biased by being aware of the results of this basic query\cite{dumas}. The participant then used the interactive SMART system, with complete freedom to reformulate the initial query and to provide relevance feedback to expand the query. The actions of the user were fully logged through the interface for later study. The user indicated whether they believed they had successfully found the relevant information using the SMART interface.
All subjects who participated in this study were experienced with boolean keyword searches on the zephyr archives. Only one subject had any experience with more advanced IR techniques. The queries they generated ranged from simple reuse of their boolean queries to natural language queries. An informal look at the query styles did not indicate that the different methods of forming queries had any effects on the results. The one participant with IR experience performed no better or worse than the others.
The exact topics used are:
These topics were chosen to represent a variety of topics typically discussed on zephyr and queried for using the interface. General topics are fairly predictable, but the prevalence of computer topics in this environment was interesting to study, in particular because fixed well-known keywords ("linux," "unix," "find," "gcc") might make keyword searching more effective than on typical collections. The first four questions are designed to require specific answers. The answers to these questions are all found several times in the zephyr archives. The second four questions ask for zephyrs on a particular topic rather than a specific piece of information. This combination was chosen to represent a typical variety of uses of the search interface. The instructions and actual list of topics, as given to the participants, is available from my project's web page.
Evaluating the results is difficult given the small number of participants and large volume of data. I originally intended to use precision-recall data on some of the four general queries, but this is impractical given that approximately 1000 documents were returned as results of those queries. Instead, I will discuss the general character of the results.
SMART performed excellently on all but two topics. A closer analysis of the results indicates that the reason for the poor performance on the dynamic libraries and renting vs.~buying questions is the issue of user perceptions discussed earlier. In fact, for all 18 initial queries used by the participants on these two topics, a relevant zephyrgram in a conversation on the exact desired topic appeared. Asking to see the "context" of that zephyrgram, i.e. the conversation in which that zephyrgram appeared, would have given the user the information needed to declare the search successful. In fact, when making my judgements on the keyword searches, I considered such a results successful. However, in most cases the participants did not ask for the context of the zephyrgrams and did not believe the search to be successful. Informal conversation with participants revealed that there were two problems.
The first was that several of the users were considering only results returned directly by SMART, without selecting the "context" links. Therefore they had more difficulties than other users finding the desired information. In particular, context links turned out to be extremely useful because a search would frequently match a question in the archive quite well, but not the actual answer given in a following zephyrgram. Context links solve this problem. A more ideal solution would use topic detection and tracking to combine each zephyr "conversation" into a single document. This would also help solve the problem of short documents causing problems with query weighting by merging most short zephyrgrams into their larger conversation.
The second major difficulty with detecting relevant documents was the length of the returned results. In each case, a zephyrgram which was fairly obviously part of an appropriate conversation was returned, but it was not necessarily as at the beginning of the list of relevant documents. It is unknown how many of the participants read all 50 zephyrgrams carefully enough to realize the appropriateness. Several participants did find and select the relevant zephyrgrams and do relevance feedback on them, but this was not helpful. One important difference that should be noted between keyword and VSM searching is that boolean keywords can usually be designed to match very specific sets of terms, therefore they return a smaller set of results. This is good when the desired information is one of the documents actually returned, but the most common failure mode observed was returning zero documents as results because none matched all the desired terms.
To eliminate bias because the participants did not have the opportunity to refine their original queries for the keyword search, the results were also calculated to determine whether success was achieved on the first SMART query submitted. Again, the results are quite good for all questions but the two discussed above.
Relevance feedback was a disappointment, but an analysis reveals that its failure indicates another problem in dealing with the short zephyrgrams. In studying when relevance feedback failed, typically users would select zephyrgrams which appeared to be relevant and use them for feedback. However, many times there were many additional words in the zephyrgram. A zephyrgram which appeared to be relevant to the IDE vs.~SCSI question, for instance, might have enough other non-stopwords that would point towards a different topic. Performing feedback on a zephyrgram discussing how the linux kernel effects the performance of IDE and SCSI disk drives, for instance, might cause discussions of compiling the linux kernel to be retrieved. In general, many short zephyrgrams do not contain a sufficient concentration in terms relevant only to the desired topic to provide effective feedback. Expanding on zephyrgrams containing news articles proved quite effective, due to their repetitive nature, but typical zephyrgrams were not as useful. Again, topic detection and tracking would prove helpful by concentrating the relevant terms in one document.
Evaluating the results is difficult given the small number of participants and large volume of data. I originally intended to use precision-recall data on some of the four general queries, but this is impractical given that approximately 1000 documents were returned as results of those queries. Instead, I will discuss the general character of the results.
SMART performed excellently on all but two topics. A closer analysis of the results indicates that the reason for the poor performance on the dynamic libraries and renting vs.~buying questions is the issue of user perceptions discussed earlier. In fact, for all 18 initial queries used by the participants on these two topics, a relevant zephyrgram in a conversation on the exact desired topic appeared. Asking to see the "context" of that zephyrgram, i.e. the conversation in which that zephyrgram appeared, would have given the user the information needed to declare the search successful. In fact, when making my judgements on the keyword searches, I considered such a results successful. However, in most cases the participants did not ask for the context of the zephyrgrams and did not believe the search to be successful. Informal conversation with participants revealed that there were two problems.
The first was that several of the users were considering only results returned directly by SMART, without selecting the "context" links. Therefore they had more difficulties than other users finding the desired information. In particular, context links turned out to be extremely useful because a search would frequently match a question in the archive quite well, but not the actual answer given in a following zephyrgram. Context links solve this problem. A more ideal solution would use topic detection and tracking to combine each zephyr "conversation" into a single document. This would also help solve the problem of short documents causing problems with query weighting by merging most short zephyrgrams into their larger conversation.
The second major difficulty with detecting relevant documents was the length of the returned results. In each case, a zephyrgram which was fairly obviously part of an appropriate conversation was returned, but it was not necessarily as at the beginning of the list of relevant documents. It is unknown how many of the participants read all 50 zephyrgrams carefully enough to realize the appropriateness. Several participants did find and select the relevant zephyrgrams and do relevance feedback on them, but this was not helpful. One important difference that should be noted between keyword and VSM searching is that boolean keywords can usually be designed to match very specific sets of terms, therefore they return a smaller set of results. This is good when the desired information is one of the documents actually returned, but the most common failure mode observed was returning zero documents as results because none matched all the desired terms.
To eliminate bias because the participants did not have the opportunity to refine their original queries for the keyword search, the results were also calculated to determine whether success was achieved on the first SMART query submitted. Again, the results are quite good for all questions but the two discussed above.
Relevance feedback was a disappointment, but an analysis reveals that its failure indicates another problem in dealing with the short zephyrgrams. In studying when relevance feedback failed, typically users would select zephyrgrams which appeared to be relevant and use them for feedback. However, many times there were many additional words in the zephyrgram. A zephyrgram which appeared to be relevant to the IDE vs.~SCSI question, for instance, might have enough other non-stopwords that would point towards a different topic. Performing feedback on a zephyrgram discussing how the linux kernel effects the performance of IDE and SCSI disk drives, for instance, might cause discussions of compiling the linux kernel to be retrieved. In general, many short zephyrgrams do not contain a sufficient concentration in terms relevant only to the desired topic to provide effective feedback. Expanding on zephyrgrams containing news articles proved quite effective, due to their repetitive nature, but typical zephyrgrams were not as useful. Again, topic detection and tracking would prove helpful by concentrating the relevant terms in one document.
SMART provided better results than keyword searches in this brief study. The experience suggest that logarithmic length normalization would be a useful feature for SMART. The short documents in the zephyr archives limited the effectiveness of relevance feedback and made it difficult to determine if a resulting zephyrgram was appropriate. The use of topic detection and tracking to combine zephyrgrams into conversations would be very helpful, and should greatly improve the precision by concentrating an entire discussion on a particular topic in one document and improving the match for the most important terms. This should be relatively simple to add to the zephyr archive system, as instances provide helpful guides for identifying similar and different topics of conversation, even when two are occurring simultaneously. TDT would also eliminate the need for a user to select the context of a relevant zephyrgram by displaying the appropriate information automatically.
Users indicated that even though the results returned included appropriate zephyrs, searching through the large number of results could be difficult. In fact, many perceived using the SMART system as more difficult than the standard keyword system, even though their keyword searches were found to give no results when actually tested. User perceptions are an important and complex part of the usability issue, as demonstrated by the two search topics in which most participants did not realize they had successfully retrieved the desired information.
For a demo, I'll probably just demonstrate the use of the two interfaces and how the relevance judgements were determined.