Lexicon Acquisition

Next: Active Learning Up: Related Work Previous: Related Work

Lexicon Acquisition

Work on automated lexicon and language acquisition dates back to Siklossy (1972), who demonstrated a system that learned transformation patterns from logic back to natural language. As already noted, the most closely related work is that of Jeff Siskind, which we described briefly in Section 2 and whose system we ran comparisons to in Section 5. Our definition of the learning problem can be compared to his ``mapping problem'' [Siskind1993]. That formulation differs from ours in several respects. First, his sentence representations are terms instead of trees. However, as shown in Figure 7, terms can also be represented as trees that conform to our formalism with some minor additions. Next, his notion of interpretation does involve a type of tree, but carries the entire representation of a sentence up to the root. Also, it is not clear how he would handle quantified variables in the representation of sentences. Skolemization is possible, but then generalization across sentences would require special handling. We make the single-use assumption and he does not. Another difference is our bias towards a minimal number of lexicon entries, while he attempts to find a monosemous lexicon. His later work [Siskind2000] relaxes this to allow ambiguity and noise, but still biases towards minimizing ambiguity. However, his formal definition does not explicitly allow lexical ambiguity, but handles it in a heuristic manner. This, though, may lead to more robustness than our method in the face of noise. Finally, our definition allows phrasal lexicon entries. Siskind's work on this topic has explored many different variations along a continuum of using many constraints but requiring more time to incorporate each new example [Siskind1993], versus few constraints but requiring more training data [Siskind1996]. Thus, perhaps his earlier systems would have been able to learn the lexicons of Section 5 more quickly; but crucially those systems did not allow lexical ambiguity, and thus also may not have learned as accurate a lexicon. More detailed comparisons to such versions of the system are outside the scope of this paper. Our goal with WOLFIE is to learn a possibly ambiguous lexicon from as few examples as possible, and we thus made comparisons along this dimension alone. Siskind's approach, like ours, takes into account constraints between word meanings that are justified by the exclusivity and compositionality assumptions. His approach is somewhat more general in that it handles noise and referential uncertainty (uncertainty about the meaning of a sentence and thus multiple possible candidates), while ours is specialized for applications where the meaning (or meanings) is known. The experimental results in Section 5 demonstrate the advantage of our method for such an application. He has demonstrated his system to be capable of learning reasonably accurate lexicons from large, ambiguous, and noisy artificial corpora, but this accuracy is only assured if the learning algorithm converges, which did not occur for our smaller corpus in the experiments we ran. Also, as already noted, his system operates in an incremental or on-line fashion, discarding each sentence as it processes it, while ours is batch. In addition, his search for word meanings proceeds in two stages, as discussed in Section 2.2. By using common substructures, we combine these two stages in WOLFIE. Both systems do have greedy aspects, ours in the choice of the next best lexical entry, his in the choice to discard utterances as noise or create a homonymous lexical entry. Finally, his system does not compute statistical correlations between words and their possible meanings, while ours does. Besides Siskind's work, there are others who approach the problem from a cognitive perspective. For example, DeMarcken (1994) also uses child language learning as a motivation, but approaches the segmentation problem instead of the learning of semantics. For training input, he uses a flat list of tokens for semantic representations, but does not segment sentences into words. He uses a variant of expectation-maximization [Dempster et al.1977], together with a form of parsing and dictionary matching techniques, to segment the sentences and associate the segments with their most likely meaning. On the Childes corpus, the algorithm achieves very high precision, but recall is not provided. Others taking the cognitive approach demonstrate language understanding by the ability to carry out some task such as parsing. For example, Nenov and Dyer (1994) describe a neural network model to map between visual and verbal-motor commands, and Colunga (1998) use neural network modeling techniques for learning spatial concepts. Feldman and his colleagues at Berkeley [Feldman et al.1995] are actively pursuing cognitive models of the acquisition of semantic concepts. Another Berkeley effort, the system by Regier (1996) is given examples of pictures paired with natural language descriptions that apply to the picture, and learns to judge whether a new sentence is true of a given picture. Similar work by Suppes (1991) uses robots to demonstrate lexicon learning. A robot is trained on cognitive and perceptual concepts and their associated actions, and learns to execute simple commands. Along similar lines, Tishby and Gorin (1994) have a system that learns associations between words and actions, but they use a statistical framework to learn these associations, and do not handle structured representations. Similarly, Oates et al. (1999) discuss the acquisition of lexical hierarchies and their associated meaning as defined by the sensory environment of a robot. The problem of automatic construction of translation lexicons [Smadja et al.1996,Melamed1995,Wu Xia1995,Kumano Hirakawa1994,Catizone et al.1993,Gale Church1991,Brown et al.1990] has a definition similar to our own. While most of these methods also compute association scores between pairs (in their case, word-word pairs) and use a greedy algorithm to choose the best translation(s) for each word, they do not take advantage of the constraints between pairs. One exception is Melamed (2000); however, his approach does not allow for phrases in the lexicon or for synonymy within one text segment, while ours does. Also, Yamazaki et al. (1995) learn both translation rules and semantic hierarchies from parsed parallel sentences in Japanese and English. Of course, the main difference between this body of work and this paper is that we map words to semantic structures, not to other words. As mentioned in the introduction, there is also a large body of work on learning lexical semantics but using different problem formulations than our own. For example, Collins (199), Riloff (1999), Roark (1998), and Schneider (1998) define semantic lexicons as a grouping of words into semantic categories, and in the latter case, add relational information. The result is typically applied as a semantic lexicon for information extraction or entity tagging. Pedersen and Chen (1995) describe a method for acquiring syntactic and semantic features of an unknown word, assuming access to an initial concept hierarchy, but they give no experimental results. Many systems [Fukumoto Tsujii1995,Haruno1995,Johnston et al.1995,Webster Marcus1995] focus only on acquisition of verbs or nouns, rather than all types of words. Also, the authors just named either do not experimentally evaluate their systems, or do not show the usefulness of the learned lexicons for a specific application. Several authors [Rooth et al.1999,Collins1997,Ribas1994,Manning1993,Resnik1993,Brent1991] discuss the acquisition of subcategorization information for verbs, and others describe work on learning selectional restrictions [Manning1993,Brent1991]. Both of these are different from the information required for mapping to semantic representation, but could be useful as a source of information to further constrain the search. Li (1998) further expands on the subcategorization work by inducing clustering information. Finally, several systems [Knight1996,Hastings1996,Russell1993] learn new words from context, assuming that a large initial lexicon and parsing system are already available. Another related body of work is grammar acquisition, especially those areas that tightly integrate the grammar with a lexicon, such as with Categorial Grammars [Retore Bonato2001,Dudau-Sofronie et al.2001,Watkinson Manandhar1999]. The theory of Categorial Grammar also has ties with lexical semantics, but these semantics have not often been used for inference in support of high-level tasks such as database retrieval. While learning syntax and semantics together is arguably a more difficult task, the aforementioned work has not been evaluated on large corpora, presumably primarily due to the difficulty of annotation.

Next: Active Learning Up: Related Work Previous: Related Work

Cindi Thompson
2003-01-02