A Robust Parsing Algorithm for Link Grammars
Abstract
In this paper we present a robust parsing algorithm based on the link
grammar formalism for parsing natural languages. Our algorithm is a
natural extension of the original dynamic programming recognition
algorithm which recursively counts the number of linkages between two
words in the input sentence. The modified algorithm uses the notion of
a null link in order to allow a connection between any pair of
adjacent words, regardless of their dictionary definitions. The
algorithm proceeds by making three dynamic programming passes. In the
first pass, the input is parsed using the original algorithm which
enforces the constraints on links to ensure grammaticality. In the
second pass, the total cost of each substring of words is
computed, where cost is determined by the number of null links necessary
to parse the substring. The final pass counts the total number of
parses with minimal cost. All of the original pruning techniques have
natural counterparts in the robust algorithm. When used together with
memoization, these techniques enable the algorithm to run efficiently
with cubic worst-case complexity.
We have implemented these ideas and tested them by parsing the
Switchboard corpus of conversational English. This corpus is
comprised of approximately three million words of text, corresponding
to more than 150 hours of transcribed speech collected from telephone
conversations restricted to 70 different topics. Although only a
small fraction of the sentences in this corpus are ``grammatical'' by
standard criteria, the robust link grammar parser is able to extract relevant
structure for a large portion of the sentences. We present the
results of our experiments using this system, including the analyses
of selected and random sentences from the corpus.
We placed a version of the robust parser on the Word Wide Web for
experimentation. It can be reached at URL
http://bobo.link.cs.cmu.edu/cgi-bin/grammar/construct-page.cgi.
In this version there are some limitations such as the maximum length of
a sentence in words and the maximum amount of memory the parser can use.