Next: Active Learning
Up: Related Work
Previous: Related Work
Work on automated lexicon and language acquisition dates back to
Siklossy (1972), who demonstrated a system that learned
transformation patterns from logic back to natural language. As
already noted, the most closely related work
is that of Jeff Siskind, which we described
briefly in Section 2 and whose system we ran comparisons to in
Section 5. Our definition of the learning problem can
be compared to his ``mapping problem'' [Siskind1993]. That
formulation differs from ours in several respects. First, his sentence
representations are terms instead of trees. However, as shown in
Figure 7, terms can also be represented as trees
that conform to our formalism with some minor additions. Next, his
notion of interpretation does involve a type of tree, but carries the
entire representation of a sentence up to the root. Also, it is not
clear how he would handle quantified variables in the representation
of sentences. Skolemization is possible, but then generalization
across sentences would require special handling. We make the
single-use assumption and he does not. Another difference is our bias
towards a minimal number of lexicon entries, while he attempts to find
a monosemous lexicon. His later work [Siskind2000]
relaxes this to allow ambiguity and noise, but still biases towards
minimizing ambiguity. However, his formal definition does not
explicitly allow lexical ambiguity, but handles it in a heuristic
manner. This, though, may lead to more robustness than our method in
the face of noise. Finally, our definition allows phrasal lexicon
entries.
Siskind's work on this topic has explored many different variations
along a continuum of using many constraints but requiring more time
to incorporate each new example [Siskind1993], versus few
constraints but requiring more training data [Siskind1996].
Thus, perhaps his earlier systems would have been able to learn the
lexicons of Section 5 more quickly; but crucially those
systems did not allow lexical ambiguity, and thus also may not have
learned as accurate a lexicon. More detailed comparisons to such
versions of the system are outside the scope of this paper. Our goal
with WOLFIE is to learn a possibly ambiguous lexicon from as few
examples as possible, and we thus made comparisons along this
dimension alone.
Siskind's approach, like ours, takes into account constraints between
word meanings that are justified by the exclusivity and compositionality
assumptions. His approach is somewhat more general in that it
handles noise and referential uncertainty (uncertainty about the
meaning of a sentence and thus multiple possible candidates), while
ours is specialized for applications where the meaning (or meanings)
is known. The experimental results in Section 5
demonstrate the advantage of our method for such an application.
He has demonstrated his system to be capable of learning
reasonably accurate lexicons from large, ambiguous, and noisy
artificial corpora, but this accuracy is only assured if the learning
algorithm converges, which did not occur for our smaller corpus in the
experiments we ran.
Also, as already noted, his system operates in an
incremental or on-line fashion, discarding each sentence as it
processes it, while ours is batch. In addition, his search for word
meanings proceeds in two stages, as discussed in
Section 2.2. By using common substructures, we combine
these two stages in WOLFIE. Both systems do have greedy aspects,
ours in the choice of the next best lexical entry, his in the choice
to discard utterances as noise or create a homonymous lexical entry.
Finally, his system does not compute statistical correlations between
words and their possible meanings, while ours does.
Besides Siskind's work, there are others who approach the problem from
a cognitive perspective. For example, DeMarcken (1994)
also uses child language learning as a motivation, but approaches the
segmentation problem instead of the learning of semantics. For training
input, he uses a flat list of tokens for semantic representations, but
does not segment sentences into words. He uses a variant of
expectation-maximization [Dempster et al.1977], together with a form
of parsing and dictionary matching techniques, to segment the
sentences and associate the segments with their most likely meaning.
On the Childes corpus, the algorithm achieves very high precision, but
recall is not provided.
Others taking the cognitive approach demonstrate language
understanding by the ability to carry out some task such as parsing.
For example, Nenov and Dyer (1994) describe a neural network model to
map between visual and verbal-motor commands, and
Colunga (1998) use neural network modeling techniques for
learning spatial concepts. Feldman and his colleagues at Berkeley
[Feldman et al.1995] are actively pursuing cognitive models of the
acquisition of semantic concepts. Another Berkeley effort, the system
by Regier (1996) is given examples of pictures paired with
natural language descriptions that apply to the picture, and learns to
judge whether a new sentence is true of a given picture.
Similar work by Suppes (1991)
uses robots to demonstrate lexicon learning.
A robot is trained on
cognitive and perceptual concepts and their associated actions, and
learns to execute simple commands. Along similar lines,
Tishby and Gorin (1994) have a system that learns associations between
words and actions, but they use a statistical framework to learn these
associations, and do not handle structured representations.
Similarly, Oates et al. (1999) discuss the acquisition of lexical
hierarchies and their associated meaning as defined by the sensory
environment of a robot.
The problem of automatic construction of translation lexicons
[Smadja et al.1996,Melamed1995,Wu Xia1995,Kumano Hirakawa1994,Catizone et al.1993,Gale Church1991,Brown et al.1990]
has a definition similar to our own.
While most of these methods also compute association scores between
pairs (in their case, word-word pairs) and use a greedy algorithm to
choose the best translation(s) for each word, they do not take
advantage of the constraints between pairs. One exception is
Melamed (2000); however, his approach does not allow for phrases
in the lexicon or for synonymy within one text segment, while ours
does. Also, Yamazaki et al. (1995) learn both translation rules and
semantic hierarchies from parsed parallel sentences in Japanese and
English. Of course, the main difference between this body of work and
this paper is that we map words to semantic structures, not
to other words.
As mentioned in the introduction,
there is also a large body of work on learning lexical
semantics but using different problem formulations than our own. For
example,
Collins (199), Riloff (1999), Roark (1998), and Schneider (1998)
define semantic lexicons as a grouping of words into semantic
categories, and in the latter case, add relational information.
The result is typically applied as a semantic lexicon for information
extraction or entity tagging.
Pedersen and Chen (1995) describe a method for acquiring syntactic and
semantic features of an unknown word, assuming access to an initial
concept hierarchy, but they give no experimental results. Many
systems
[Fukumoto Tsujii1995,Haruno1995,Johnston et al.1995,Webster Marcus1995] focus
only on acquisition of verbs or nouns, rather than all types of words.
Also, the authors just named either do not experimentally evaluate
their systems, or do not show the usefulness of the learned lexicons
for a specific application.
Several authors
[Rooth et al.1999,Collins1997,Ribas1994,Manning1993,Resnik1993,Brent1991]
discuss the acquisition of subcategorization information for verbs,
and others describe work on learning selectional restrictions
[Manning1993,Brent1991].
Both of these
are different from the information required for mapping to semantic
representation, but could be useful as a source of information to
further constrain the search. Li (1998) further expands on
the subcategorization work by inducing clustering information.
Finally, several systems
[Knight1996,Hastings1996,Russell1993] learn new
words from context, assuming that a large initial lexicon and parsing
system are already available.
Another related body of work is grammar acquisition, especially those
areas that tightly integrate the grammar with a lexicon, such as with
Categorial Grammars [Retore Bonato2001,Dudau-Sofronie et al.2001,Watkinson Manandhar1999].
The theory of Categorial Grammar also has ties with lexical semantics,
but these semantics have not often been used for inference in support
of high-level tasks such as database retrieval. While learning syntax
and semantics together is arguably a more difficult task, the
aforementioned work has not been evaluated on large corpora,
presumably primarily due to the difficulty of annotation.
Next: Active Learning
Up: Related Work
Previous: Related Work
Cindi Thompson
2003-01-02