Search Engine Support for HLT Applications

Jamie Callan
Carnegie Mellon University

Project Overview

Question-answering, computer-assisted language learning, and other human language technology (HLT) applications that use a text search engine to dynamically find useful information in a large, possibly heterogeneous text corpus are becoming more common. This use of text search is very different than the ad-hoc, interactive search that information retrieval research typically studies. HLT applications make significant use of text annotations to reduce the mismatch between the concept-based representations convenient for reasoning and the word-based representations convenient for text retrieval. They have, and can describe, specific requirements that retrieved text passages must satisfy, for example, that it is at a particular level of reading difficulty, contains a particular type of named-entity (e.g., a person name), or satisfies a particular Subject/Verb/Object or logical form pattern. Current search engine indexing techniques and retrieval models do not support this type of use well.

This project extends indexing and retrieval models developed initially for structured (e.g., XML) documents to provide convenient use of multiple representations of document content; support for multiple, detailed, overlapping text annotations; support for hierarchical text annotations such as parse trees; support for annotations with associated confidence values; best-match methods for text annotations; better support for using prior probabilities of relevance; and support for modeling term relationships. The new retrieval models are explicitly trainable, to support rapid and principled inclusion of new types of annotations and attributes.

By explictly recognizing HLT applications as first-class users, our research broadens the research community's view of text search well beyond the simple queries, text representations, and retrieval models that typically characterize interactive, ad-hoc search today. It may eventually also lead to improved interactive search by providing a medium in which more powerful search interfaces can express constraints and preferences derived from sophisticated user and task models.

Project Personnel

Jamie Callan , Principal Investigator
Kevyn Collins-Thompson , Ph.D. 2008
Paul Ogilvie , Graduate Research Assistant
Le Zhao , Graduate Research Assistant

Dissemination of Research Results

Our research results are disseminated by research publications, and as part of the open-source Lemur Toolkit.

Collaborating Projects

GALE, Javelin (completed), and REAP

This research is sponsored in part by National Science Foundation grant IIS-0534345 and a gift from Yahoo! Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.

Updated on March 30, 2008.

Jamie Callan