next up previous
Next: Comparisons using English Up: Evaluation of WOLFIE Previous: Evaluation of WOLFIE

   
A Database Query Application

This section describes our experimental results on a database query application. The first corpus discussed contains 250 questions about U.S. geography, paired with their Prolog query to extract the answer to the question from a database. This domain was originally chosen due to the availability of a hand-built natural language interface, GEOBASE, to a database containing about 800 facts. GEOBASE was supplied with Turbo Prolog 2.0 [Borland International1988], and designed specifically for this domain. The questions in the corpus were collected by asking undergraduate students to generate English questions for this database, though they were given only cursory knowledge of the database without being given a chance to use it. To broaden the test, we had the same 250 sentences translated into Spanish, Turkish, and Japanese. The Japanese translations are in word-segmented Roman orthography. Translated questions were paired with the appropriate logical queries from the English corpus. To evaluate the learned lexicons, we measured their utility as background knowledge for CHILL. This is performed by choosing a random set of 25 test examples and then learning lexicons and parsers from increasingly larger subsets of the remaining 225 examples (increasing by 50 examples each time). After training, the test examples are parsed using the learned parser. We then submit the resulting queries to the database, compare the answers to those generated by submitting the correct representation to the database, and record the percentage of correct (matching) answers. By using the difficult ``gold standard'' of retrieving a correct answer, we avoid measures of partial accuracy that we believe do not adequately measure final utility. We repeated this process for ten different random training and test sets and evaluated performance differences using a two-tailed, paired t-test with a significance level of $p \leq 0.05$. We compared our system to an incremental (on-line) lexicon learner developed by Siskind (1996). To make a more equitable comparison to our batch algorithm, we ran his in a ``simulated'' batch mode, by repeatedly presenting the corpus 500 times, analogous to running 500 epochs to train a neural network. While this does not actually add new kinds of data over which to learn, it allows his algorithm to perform inter-sentential inference in both directions over the corpus instead of just one. Our point here is to compare accuracy over the same size training corpus, a metric not optimized for by Siskind. We are not worried about the difference in execution time here,10 and the lexicons learned when running Siskind's system in incremental mode (presenting the corpus a single time) resulted in substantially lower performance in preliminary experiments with this data. We also removed WOLFIE's ability to learn phrases of more than one word, since the current version of Siskind's system does not have this ability. Finally, we made comparisons to the parsers learned by CHILL when using a hand-coded lexicon as background knowledge. In this and similar applications, there are many terms, such as state and city names, whose meanings can be automatically extracted from the database. Therefore, all tests below were run with such names given to the learner as an initial lexicon; this is helpful but not required. Section 5.2 gives results for a different task with no such initial lexicon. However, unless otherwise noted, for all tests within this Section (5.1) we did not strip sentences of phrases known to have empty meanings, unlike in the example of Section 4.

 
next up previous
Next: Comparisons using English Up: Evaluation of WOLFIE Previous: Evaluation of WOLFIE
Cindi Thompson
2003-01-02