The previous section showed that WOLFIE successfully learns lexicons
for a natural corpus and a realistic task. However, this demonstrates success on only a
relatively small corpus and with one representation formalism. We now
show that our algorithm scales up well with more lexicon items to
learn, more ambiguity, and more synonymy. These factors are difficult to
control when using real data as input. Also, there are no large
corpora available that are annotated with semantic parses. We
therefore present experimental results on an artificial corpus.
In this corpus, both the sentences and
their representations are completely artificial, and the sentence
representation is a variable-free representation, as suggested by the
work of Jackendoff (1990) and others.
For each corpus discussed below, a random lexicon mapping words to
simulated meanings was first constructed.11 This original lexicon was then used
to generate a corpus of random utterances each paired with a meaning
representation. After using this corpus as input to
WOLFIE12, the
learned lexicon was compared to the original lexicon, and
weighted precision and weighted recall of the learned lexicon
were measured. Precision measures the
percentage of the lexicon entries (i.e., word-meaning pairs) that the system
learns that are correct. Recall measures the percentage of the lexicon
entries in the hand-built lexicon that are correctly learned by the system:
To get weighted precision and recall measures, we then weight the
results for each pair by the word's frequency in the entire corpus
(not just the training corpus). This models how likely we are to have
learned the correct meaning for an arbitrarily chosen word in the
corpus.
We generated several lexicons and associated corpora, varying the
ambiguity rate (number of meanings per word) and synonymy rate
(number of words per meaning), as in Siskind (1996). Meaning representations
were generated using a set of ``conceptual symbols''
that combined to form the meaning for each word. The number of
conceptual symbols used in each lexicon will be noted when we describe
each corpus below.
In each lexicon, 47.5% of the senses were variable-free to simulate
noun-like meanings, and 47.5% contained from one to three variables to denote
open argument positions to simulate verb-like meanings. The remainder
of the words (the remaining 5%) had the empty meaning to simulate
function words. In addition, the functors in each meaning could have
a depth of up to two and an arity of up to two. An example noun-like
meaning is f23(f2(f14)), and a verb-meaning f10(A,f15(B));
the conceptual symbols in this example are f23, f2,
f14, f10, and f15. By using these multi-level meaning
representations we demonstrate the learning of more complex
representations than those in the geography database domain: none of
the hand-built meanings for phrases in that lexicon had functors
embedded in arguments. We used a grammar to generate utterances and
their meanings from each original lexicon, with terminal categories
selected using a distribution based on Zipf's Law
[Zipf1949]. Under Zipf's Law, the occurrence frequency of a
word is inversely proportional to its ranking by occurrence.
We started with a baseline corpus generated from a lexicon of 100
words using 25 conceptual symbols and no ambiguity or synonymy; 1949
sentence-meaning pairs were generated. We split this into five training
sets of 1700 sentences each.
Figure 16 shows the weighted precision and recall curves
for this initial test.
Figure 16:
Baseline Artificial Corpus
This demonstrates good scalability to a slightly larger corpus and lexicon than
that of the U.S. geography query domain.
A second corpus was generated from a second lexicon, also of 100 words
using 25 conceptual symbols, but increasing the ambiguity to 1.25
meanings per word. This time, 1937 pairs were generated and the
corpus split into five sets of 1700 training examples each. Weighted
precision at 1650 examples drops to 65.4% from the previous level of 99.3%, and weighted
recall to 58.9% from 99.3%. The full learning curve is shown in
Figure 17.
Figure 17:
A More Ambiguous Artificial Corpus
A quick comparison to Siskind's performance on this corpus confirmed
that his system achieved comparable performance, showing that with
current methods, this is close to the best performance that we are
able to obtain on this more difficult corpus.
One possible explanation for the smaller performance difference
between the two systems on this corpus versus the geography domain is
that in this domain, the correct meaning for a word is not necessarily
the most ``general,'' in terms of number of vertices, of all its
candidate meanings. Therefore, the generality portion of the
heuristic may negatively influence the performance of WOLFIE in this
domain.
Finally, we show the change in performance with increasing
ambiguity and increasing synonymy, holding the number of words and
conceptual symbols constant. Figure 18
shows the weighted precision and recall with 1050 training examples
for increasing levels of ambiguity, holding the synonymy
level constant. Figure 19 shows the results
at increasing levels of synonymy, holding ambiguity constant.
Increasing the level of synonymy does not effect the results
as much as increasing the level of ambiguity, which is
as we expected. Holding the corpus size constant but
increasing the number of competing meanings for a word
increases the number of candidate meanings created by WOLFIE while
decreasing the amount of evidence available for each meaning (e.g.,
the first component of the heuristic measure) and
makes the learning task more difficult. On the other hand,
increasing the level of synonymy does not have the potential to
mislead the learner.
Figure 18:
Increasing the Level of Ambiguity
Figure 19:
Increasing the Level of Synonymy
The number of training examples required to reach a certain level of
accuracy is also informative. In Table 4, we show
the point at which a standard precision of 75% was first reached for
each level of ambiguity. Note, however, that we only measured
accuracy after each set of 100 training examples, so the numbers in the
table are approximate.
Table 4:
Number of Examples to Reach 75% Precision
Ambiguity Level
Number of Examples
1.0
150
1.25
450
2.0
1450
We performed a second test of scalability on two corpora generated
from lexicons an order of magnitude larger than those in the above
tests. In these tests, we use a lexicon containing 1000 words and
using 250 conceptual symbols. We generated both a corpus with no
ambiguity, and one from a lexicon with ambiguity and
synonymy similar to that found in the WordNet database
[Beckwith et al.1991]; the ambiguity there is approximately 1.68
meanings per word and the synonymy 1.3 words per meaning. These
corpora contained 9904 (no ambiguity) and 9948 examples, respectively,
and we split the data into five sets of 9000 training examples each.
For the easier large corpus, the maximum average of weighted precision
and recall was 85.6%, at 8100 training examples, while for the harder
corpus, the maximum average was 63.1% at 8600 training examples.
Next:Active Learning Up:Evaluation of WOLFIE Previous:LICS versus FracturingCindi Thompson 2003-01-02