Question-answering, computer-assisted language learning, and other human language technology (HLT) applications that use a text search engine to dynamically find useful information in a large, possibly heterogeneous text corpus are becoming more common. This use of text search is very different than the ad-hoc, interactive search that information retrieval research typically studies. HLT applications make significant use of text annotations to reduce the mismatch between the concept-based representations convenient for reasoning and the word-based representations convenient for text retrieval. They have, and can describe, specific requirements that retrieved text passages must satisfy, for example, that it is at a particular level of reading difficulty, contains a particular type of named-entity (e.g., a person name), or satisfies a particular Subject/Verb/Object or logical form pattern. Current search engine indexing techniques and retrieval models do not support this type of use well.
This project extends indexing and retrieval models developed initially for structured (e.g., XML) documents to provide convenient use of multiple representations of document content; support for multiple, detailed, overlapping text annotations; support for hierarchical text annotations such as parse trees; support for annotations with associated confidence values; best-match methods for text annotations; better support for using prior probabilities of relevance; and support for modeling term relationships. The new retrieval models are explicitly trainable, to support rapid and principled inclusion of new types of annotations and attributes.
By explictly recognizing HLT applications as first-class users, our research broadens the research community's view of text search well beyond the simple queries, text representations, and retrieval models that typically characterize interactive, ad-hoc search today. It may eventually also lead to improved interactive search by providing a medium in which more powerful search interfaces can express constraints and preferences derived from sophisticated user and task models.
Our research results are disseminated by research publications, and as part of the open-source Lemur Toolkit.
This research is sponsored in part by National Science Foundation grant IIS-0534345 and a gift from Yahoo! Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors. |