Artificial Data

Next: Active Learning Up: Evaluation of WOLFIE Previous: LICS versus Fracturing

Artificial Data

The previous section showed that WOLFIE successfully learns lexicons for a natural corpus and a realistic task. However, this demonstrates success on only a relatively small corpus and with one representation formalism. We now show that our algorithm scales up well with more lexicon items to learn, more ambiguity, and more synonymy. These factors are difficult to control when using real data as input. Also, there are no large corpora available that are annotated with semantic parses. We therefore present experimental results on an artificial corpus. In this corpus, both the sentences and their representations are completely artificial, and the sentence representation is a variable-free representation, as suggested by the work of Jackendoff (1990) and others. For each corpus discussed below, a random lexicon mapping words to simulated meanings was first constructed.¹¹ This original lexicon was then used to generate a corpus of random utterances each paired with a meaning representation. After using this corpus as input to WOLFIE¹², the learned lexicon was compared to the original lexicon, and weighted precision and weighted recall of the learned lexicon were measured. Precision measures the percentage of the lexicon entries (i.e., word-meaning pairs) that the system learns that are correct. Recall measures the percentage of the lexicon entries in the hand-built lexicon that are correctly learned by the system:

$\begin{displaymath}precision = \frac{\char93 \mbox{ correct pairs}}{\char93 \mbox{ pairs learned}} \end{displaymath}$

$\begin{displaymath}recall = \frac{\char93 \mbox{ correct pairs}}{\char93 \mbox{ pairs in hand-built lexicon} }. \end{displaymath}$

To get weighted precision and recall measures, we then weight the results for each pair by the word's frequency in the entire corpus (not just the training corpus). This models how likely we are to have learned the correct meaning for an arbitrarily chosen word in the corpus. We generated several lexicons and associated corpora, varying the ambiguity rate (number of meanings per word) and synonymy rate (number of words per meaning), as in Siskind (1996). Meaning representations were generated using a set of ``conceptual symbols'' that combined to form the meaning for each word. The number of conceptual symbols used in each lexicon will be noted when we describe each corpus below. In each lexicon, 47.5% of the senses were variable-free to simulate noun-like meanings, and 47.5% contained from one to three variables to denote open argument positions to simulate verb-like meanings. The remainder of the words (the remaining 5%) had the empty meaning to simulate function words. In addition, the functors in each meaning could have a depth of up to two and an arity of up to two. An example noun-like meaning is f23(f2(f14)), and a verb-meaning f10(A,f15(B)); the conceptual symbols in this example are f23, f2, f14, f10, and f15. By using these multi-level meaning representations we demonstrate the learning of more complex representations than those in the geography database domain: none of the hand-built meanings for phrases in that lexicon had functors embedded in arguments. We used a grammar to generate utterances and their meanings from each original lexicon, with terminal categories selected using a distribution based on Zipf's Law [Zipf1949]. Under Zipf's Law, the occurrence frequency of a word is inversely proportional to its ranking by occurrence. We started with a baseline corpus generated from a lexicon of 100 words using 25 conceptual symbols and no ambiguity or synonymy; 1949 sentence-meaning pairs were generated. We split this into five training sets of 1700 sentences each. Figure 16 shows the weighted precision and recall curves for this initial test.

**Figure 16:** Baseline Artificial Corpus
$\begin{figure}\centerline{\epsfxsize=4.5in \epsfbox{artif-1.ps}} \end{figure}$

This demonstrates good scalability to a slightly larger corpus and lexicon than that of the U.S. geography query domain. A second corpus was generated from a second lexicon, also of 100 words using 25 conceptual symbols, but increasing the ambiguity to 1.25 meanings per word. This time, 1937 pairs were generated and the corpus split into five sets of 1700 training examples each. Weighted precision at 1650 examples drops to 65.4% from the previous level of 99.3%, and weighted recall to 58.9% from 99.3%. The full learning curve is shown in Figure 17.

**Figure 17:** A More Ambiguous Artificial Corpus
$\begin{figure}\centerline{\epsfxsize=4.5in \epsfbox{artif-2.ps}} \end{figure}$

A quick comparison to Siskind's performance on this corpus confirmed that his system achieved comparable performance, showing that with current methods, this is close to the best performance that we are able to obtain on this more difficult corpus. One possible explanation for the smaller performance difference between the two systems on this corpus versus the geography domain is that in this domain, the correct meaning for a word is not necessarily the most ``general,'' in terms of number of vertices, of all its candidate meanings. Therefore, the generality portion of the heuristic may negatively influence the performance of WOLFIE in this domain. Finally, we show the change in performance with increasing ambiguity and increasing synonymy, holding the number of words and conceptual symbols constant. Figure 18 shows the weighted precision and recall with 1050 training examples for increasing levels of ambiguity, holding the synonymy level constant. Figure 19 shows the results at increasing levels of synonymy, holding ambiguity constant. Increasing the level of synonymy does not effect the results as much as increasing the level of ambiguity, which is as we expected. Holding the corpus size constant but increasing the number of competing meanings for a word increases the number of candidate meanings created by WOLFIE while decreasing the amount of evidence available for each meaning (e.g., the first component of the heuristic measure) and makes the learning task more difficult. On the other hand, increasing the level of synonymy does not have the potential to mislead the learner.

**Figure 18:** Increasing the Level of Ambiguity
$\begin{figure}\centerline{\epsfxsize=4.5in \epsfbox{ambig.ps}} \end{figure}$

**Figure 19:** Increasing the Level of Synonymy
$\begin{figure}\centerline{\epsfxsize=4.5in \epsfbox{synon.ps}} \end{figure}$

The number of training examples required to reach a certain level of accuracy is also informative. In Table 4, we show the point at which a standard precision of 75% was first reached for each level of ambiguity. Note, however, that we only measured accuracy after each set of 100 training examples, so the numbers in the table are approximate.

Table 4: Number of Examples to Reach 75% Precision

Ambiguity Level	Number of Examples
1.0	150
1.25	450
2.0	1450

We performed a second test of scalability on two corpora generated from lexicons an order of magnitude larger than those in the above tests. In these tests, we use a lexicon containing 1000 words and using 250 conceptual symbols. We generated both a corpus with no ambiguity, and one from a lexicon with ambiguity and synonymy similar to that found in the WordNet database [Beckwith et al.1991]; the ambiguity there is approximately 1.68 meanings per word and the synonymy 1.3 words per meaning. These corpora contained 9904 (no ambiguity) and 9948 examples, respectively, and we split the data into five sets of 9000 training examples each. For the easier large corpus, the maximum average of weighted precision and recall was 85.6%, at 8100 training examples, while for the harder corpus, the maximum average was 63.1% at 8600 training examples.

Next: Active Learning Up: Evaluation of WOLFIE Previous: LICS versus Fracturing

Cindi Thompson
2003-01-02