Discourse Structure from Topic Models in Text and Video

Abstract

This talk describes how latent topic models can be used to discover discourse structure in unannotated text and video. Linguists have long believed that discourse topic shifts are marked by changes in the distribution of lexical items. This idea is called lexical cohesion, and can be formalized in a latent topic model, yielding substantial performance gains over previous heuristic approaches. More importantly, this Bayesian setting permits several interesting extensions:

(1) Explicit cue phrases for topic transitions are clearly relevant for segmentation, but can not be handled by previous unsupervised methods. I'll show how such cue phrases can be discovered without annotation and incorporated to improve segmentation. (2) Topic segmentation can be applied to multimedia data by searching for self-similarity in visual communication. I'll present a topic segmenter for conversational speech that integrates lexical and gestural cohesion. (3) Hierarchical structure can be discovered by modeling lexical cohesion as a multi-scale phenomenon, in which some words are governed by low-level subtopics, and others by the high-level topics. Inference is performed jointly across scale-levels, improving on greedy top-down approaches.

Overall, these extensions outperform the state-of-the-art on several tasks, and point the way to more comprehensive analysis of discourse structure through hierarchical Bayesian models.

Bio

Jacob Eisenstein is a Beckman Postdoctoral Fellow at the University of Illinois. He completed his doctorate at MIT in 2008 under the supervision of Regina Barzilay and Randall Davis. His thesis, titled "Gesture in Automatic Discourse Processing," won the 2008 George M. Sprowls award for best Doctoral theses in Computer Science at MIT. Working in the domain of computational linguistics, Jacob's research focuses on applying state-of-the-art structured learning techniques to discourse processing and visual communication.

Venue, Date, and Time

Venue: Newell Simon Hall 1507

Date: Monday, April 13, 2009

Time: 12:00 noon