Next: Adding to the Final
Up: Our Solution: WOLFIE
Previous: Our Solution: WOLFIE
Initial candidate meanings for a phrase are produced by computing the
maximally common substructure(s) between sampled pairs of representations of
sentences that contain it. We derive common
substructure by computing the Largest Isomorphic Connected Subgraphs
(LICS) of two labeled trees, taking labels into account in the
isomorphism. The analogous Largest Common Subgraph
problem [Garey Johnson1979] is solvable in
polynomial time if, as we assume, both inputs are trees and if K,
the number of edges to include, is given.
Thus, we start with K set equal to the largest number of edges in the two
trees being compared, test for common subgraph(s), and iterate down to
K=1, stopping when one or more subgraphs are found for a given K.
For the Prolog query representation, the algorithm is complicated a
bit by variables. Therefore, we use LICS with an addition similar to
computing the Least General Generalization of first-order clauses
[Plotkin1970]. The LGG of two sets of literals is the
least general set of literals that subsumes both sets of literals. We
add to this by allowing that when a term in the argument of a literal
is a conjunction, the algorithm tries all orderings in its matching of
the terms in the conjunction. Overall, our algorithm for finding the
LICS between two trees in the Prolog representation first finds the
common labeled edges and vertices as usual in LICS, but treats all
variables as equivalent. Then, it computes the Least General
Generalization, with conjunction taken into account, of the resulting
trees as converted back into literals. For example, given the two
trees:
answer(C, (largest(P, (state(S), population(S,P))), capital(S,C))).
answer(P, (high_point(S,P), largest(A, (state(S), area(S,A))))).,
the common meaning is answer(_,largest(_,state(_)). Note
that the LICS of two trees may not be unique: there may be multiple
common subtrees that both contain the same number of edges; in this
case LICS returns multiple answers.
Table 1:
Sample Candidate Lexical Entries and their Derivation
|
The sets of initial candidate meanings for some of the
phrases in the sample corpus are shown in Table 1.
While in this example we show the LICS for all pairs that a phrase
appears in, in the actual algorithm we
randomly sample a subset for efficiency reasons,
as in GOLEM [Muggleton Feng1990].
For phrases appearing in only one
sentence (e.g., ``encuentra''), the entire sentence representation
(excluding the database constant given as background knowledge)
is used as an initial candidate meaning. Such candidates are
typically generalized in step 2.2 of the algorithm to only the correct
portion of the representation before they are added to the lexicon; we
will see an example of this below.
Next: Adding to the Final
Up: Our Solution: WOLFIE
Previous: Our Solution: WOLFIE
Cindi Thompson
2003-01-02