In order to evaluate our approach we need to assess how good the automatically
learned ontologies reflect a given domain. One possibility would be to compute
how many of the superconcept relations in the automatically learned ontology
are correct. This is for example done by [33] or [8]. However, due to the fact that our approach, as well as many others (compare [34,46,30]), does not produce appropriate names for the abstract concepts generated,
it seems difficult to assess the validity of a given superconcept relation.
Another possibility is to compute how 'similar' the automatically learned concept
hierarchy is with respect to a given hierarchy for the domain in question. Here
the crucial question is how to define similarity between concept hierarchies.
Though there is a great amount of work in the AI community on how to compute the
similarity between trees [63,28], concept lattices [4], conceptual graphs [42,45] and (plain) graphs [11,64], it is not clear how these
similarity measures also translate to concept hierarchies. An interesting work in
these lines is the one presented by [40] in which ontologies are
compared along different levels: semiotic, syntactic and pragmatic. In particular,
the authors present measures to compare the lexical and taxonomic overlap
between two ontologies.
Furthermore, they also present an interesting study in which different subjects
were asked to model a tourism ontology. The resulting ontologies are compared in terms
of the defined similarity measures thus yielding the agreement of different
subjects on the task of modeling an ontology.
In order to formally define our evaluation measures, we introduce a
core ontology model in line with the ontological model presented by
[59]:
Definition 6 (Core Ontology)
A core ontology is a structure
consisting of (i)
a set of concept identifiers, (ii) a designated root element
representing the top element of the (iii) partial order
on
such that
,
called concept hierarchy or taxonomy.
For the sake of notational simplicity we adopt the following convention: given
an ontology , the corresponding set of concepts will be denoted
by and the partial order representing the concept hierarchy
by .
It is important to mention that in the approach presented
here, terms are directly identified with concepts, i.e. we
neglect the fact that terms can be polysemous.6
Now, the Lexical Recall (LR) of two ontologies and is measured as follows:7
Take for example the concept hierarchies and depicted in
Figure 4. In this example, the Lexical Recall is
.
Figure 4:
Example for an automatically acquired concept hierarchy (left)
compared to the reference concept hierarchy (right)
Figure 5:
Example for a perfectly learned concept hierarchy (left)
compared to the reference concept hierarchy (right)
In order to compare the taxonomy of two ontologies, we use the
Semantic Cotopy (SC) presented by [40].
The Semantic Cotopy of a concept is defined as the set of all its super- and
subconcepts:
In what follows we illustrate these and other definitions on the basis
of several example concept hierarchies. Take for instance the concept
hierarchies in Figure 5. We assume that the left concept
hierarchy has been automatically learned with our FCA
approach and that the concept hierarchy on the right is a handcrafted
one. Further, it is important to point out that the left ontology is, in terms
of the arrangement of the leave nodes and abstracting from the labels
of the inner nodes, a perfectly learned concept hierarchy. This should
thus be reflected by a maximum similarity between both ontologies.
The Semantic Cotopy of the concept vehicle in the right ontology
in Figure 5 is for example {car, bike, two-wheeled vehicle, vehicle, object-to-rent} and the Semantic Cotopy of driveable in the left ontology is {bike, car, rideable, driveable, rentable, bookable}.
It becomes thus already clear that comparing the cotopies of both
concepts will not yield the desired results, i.e. a maximum similarity
between both concepts. Thus we use a modified version SC' of the Semantic Cotopy
in which we only consider the concepts common to both concept hierarchies
in the Semantic Cotopy (compare [14,15]), i.e.
By using this Common Semantic Cotopy we thus exclude from the comparison concepts such as runable, offerable, needable, activity, vehicle etc. which are only in one ontology.
So, the Common Semantic Cotopy of the concepts vehicle and driveable
is identical in both ontologies in Figure 5, i.e. {bike, car}
thus representing a perfect overlap between both concepts, which
certainly corresponds to our intuitions about the similarity of
both concepts. However, let's now consider the concept hierarchy in
Figure 6. The common cotopy of the concept bike is {bike} in both concept
hierarchies. In fact, every leave concept in the left concept hierarchy
has a maximum overlap with the corresponding concept in the right ontology.
This is certainly undesirable and in fact leads to very high baselines
when comparing such trivial concept hierarchies with a reference
standard (compare our earlier results [14,15]).
Thus, we introduce a further modification of the Semantic Cotopy by
excluding the concept itself from its Common Semantic Cotopy, i.e:
This maintains the perfect overlap between vehicle and driveable
in the concept hierarchies in Figure 5, while yielding empty common
cotopies for all the leave concepts in the left ontology of Figure 6.
Figure 6:
Example for a trivial concept hierarchy (left)
compared to the reference concept hierarchy (right)
Now, according to Maedche et al. the Taxonomic Overlap () of two ontologies and is computed as follows:
where
and TO' and TO'' are defined as follows:
So, gives the similarity between concepts which are in both ontologies
by comparing their respective semantic cotopies. In contrast, gives
the similarity between a concept and that
concept in which maximizes the overlap of the respective semantic
cotopies, i.e. it makes an optimistic estimation assuming an overlap that just does not happen to show up at the immediate lexical surface (compare [40]). The Taxonomic Overlap
between the two ontologies is then calculated by averaging over all the taxonomic
overlaps of the concepts in .
In our case it doesn't make sense to calculate the Semantic Cotopy for
concepts which are in both ontologies as they represent leave nodes and
thus their common semantic cotopies are empty. Thus, we calculate
the Taxonomic Overlap between two ontologies as follows:
Finally, as we do not only want to compute the Taxonomic Overlap in one
direction, we introduce the precision, recall and an F-Measure calculating
the harmonic mean of both:
The importance of balancing recall and
precision against each other will be clear in the discussion of a
few examples below.
Let's consider for example the concept hierarchy in Figure
5. For the five concepts bookable, joinable, rentable,
driveable and rideable we find a corresponding concept in
with a maximum Taxonomic Overlap and the other way round
for the concepts activity, object-to-rent, vehicle and two-wheeled-vehicle in , such that
.
Figure:
Example for a concept hierarchy with lower recall (
)
compared to the reference concept hierarchy
Figure:
Example for a concept hierarchy with lower precision (
)
compared to reference concept hierarchy
In the concept hierarchy
shown in Figure 7 the precision is still 100% for the same reasons as above, but due to the
fact that the rideable concept has been removed there is no
corresponding concept for two-wheeled-vehicle. The concept maximizing
the taxonomic similarity in for two-wheeled-vehicle is
driveable with a Taxonomic Overlap
of 0.5. The recall is thus
and the F-Measure decreases to
.
In the concept hierarchy of
in
Figure 8, an additional concept planable has been
introduced, which reduces the precision to
, while the recall stays obviously the same at
and thus the F-Measure is
. It becomes thus clear why it is important to measure
the precision and recall of the automatically learned concept hierarchies
and balance them against each other by the harmonic mean or F-Measure.
For the automatically learned concept hierarchy in Figure
4 the precision is
, the recall
and thus the F-Measure
.
As a comparison, for the trivial concept hierarchy in Figure
6 we get
(per definition),
and
.
It is important to mention
that though in our toy examples the difference with respect to these
measures between the automatically learned concept hierarchy
and the trivial concept hierarchy is not so big, when
considering real-world concept hierarchies with a much higher number
of concepts it is clear that the F-Measures
for trivial concept hierarchies will be very low (see the results
in Section 6).
Finally, we also calculate the harmonic mean of the lexical recall
and the F-Measure as follows:
For the automatically learned concept hierarchy , we get for
example: