Next: Overall Process
Up: Learning Concept Hierarchies from
Previous: Learning Concept Hierarchies from
Taxonomies or concept hierarchies are crucial for any knowledge-based system,
i.e. a system equipped with declarative knowledge about the domain it deals
with and capable of reasoning on the basis of this knowledge.
Concept hierarchies are in fact important because they allow
to structure information into categories, thus fostering its search and
reuse. Further, they allow to formulate rules as well as relations
in an abstract and concise way, facilitating the development, refinement and
reuse of a knowledge-base.
Further, the fact that they allow to generalize over words has shown to
provide benefits in a number of applications such as Information Retrieval
[62] as well as text clustering [35] and
classification [6]. In addition, they also have
important applications within Natural Language Processing (e.g. [12]).
However, it is also well known that any knowledge-based system suffers from
the so-called knowledge acquisition bottleneck, i.e. the difficulty
to actually model the domain in question. In order to partially overcome
this problem we present a novel approach to automatically learning a concept
hierarchy from a text corpus.
Making the knowledge implicitly contained in texts explicit
is a great challenge. For example, [7]
have argued that text writing and
reading is in fact a process of background knowledge maintenance in
the sense that basic domain knowledge is assumed, and only the
relevant part of knowledge which is the issue of the text or article is
mentioned in a more or less explicit way. Actually, knowledge can
be found in texts at different levels of explicitness depending on
the sort of text considered. Handbooks, textbooks or dictionaries
for example contain explicit knowledge in form of definitions such
as ``a tiger is a mammal'' or ``mammals such as tigers, lions or
elephants''. In fact, some researchers have exploited such regular
patterns to discover taxonomic or
part-of relations in texts [33,10,36,2].
However, it seems that the more technical and specialized the
texts get, the less basic knowledge we find stated explicitly. Thus,
an interesting alternative is to derive
knowledge from texts by analyzing how certain terms are used
rather than to look for their explicit definition. In these lines
the distributional hypothesis [32] assumes that terms
are similar to the extent to which they share similar linguistic contexts.
In fact, different methods have been proposed in the literature to
address the problem of (semi-) automatically deriving a concept hierarchy
from text based on the distributional hypothesis. Basically, these methods
can be grouped into two classes:
the similarity-based methods on the one hand and the
set-theoretical on the other hand. Both methods adopt a
vector-space model and represent a word or term as a vector containing
features or attributes derived from a certain corpus. There is certainly a
great divergence in which attributes are used for this purpose, but typically
some sort of syntactic features are used, such as conjunctions, appositions
[8] or verb-argument dependencies [34,46,30,24].
The first type of methods is characterized by the use
of a similarity or distance measure in order to compute the
pairwise similarity or distance between vectors corresponding to two words or
terms in order to decide if they can be clustered or not. Some prominent
examples for this type of method have been developed by [34,46,30,24,8] as well as [5].
Set-theoretical approaches partially order the objects according to
the inclusion relations between their attribute sets [47,56].
In this paper, we present an approach based on
Formal Concept Analysis, a method based on order theory and mainly used for
the analysis of data, in particular for discovering inherent relationships
between objects described through a set of attributes on the one hand, and the
attributes themselves on the other
[26]. In order to derive attributes from a certain corpus,
we parse it and extract verb/prepositional phrase (PP)-complement, verb/object and verb/subject
dependencies. For each noun appearing as head of these argument positions
we then use the corresponding verbs as attributes for building the formal
context and then calculating the formal concept lattice on its basis.
Though different methods have been explored in the literature, there
is actually a lack of comparative work concerning the task of automatically
learning concept hierarchies with clustering techniques. However, as argued
by [15],
ontology engineers need guidelines about the effectiveness, efficiency
and trade-offs of different methods in order to decide which techniques
to apply in which settings. Thus, we present a comparison along these
lines between our FCA-based approach, hierarchical bottom-up (agglomerative)
clustering and Bi-Section-KMeans as an instance of a divisive algorithm.
In particular, we compare the learned concept hierarchies in terms of similarity
with handcrafted reference taxonomies for two domains: tourism and
finance. In addition, we examine the impact of using different
information measures to weight the significance of a given
object/attribute pair. Furthermore, we also investigate the use
of a smoothing technique to cope with data sparseness.
The remainder of this paper is organized as follows: Section 2
describes the overall process and Section 3 briefly
introduces Formal Concept Analysis and describes the nature of the concept
hierarchies we automatically acquire. Section 4 describes the
text processing methods we apply to automatically derive context attributes.
In Section 5 we discuss in detail our evaluation
methodology and present the actual results in Section 6.
In particular, we present the comparison of the different approaches
as well as the evaluation of the impact of different information
measures as well as of our smoothing technique.
Before concluding, we discuss some related work in Section 7.
Next: Overall Process
Up: Learning Concept Hierarchies from
Previous: Learning Concept Hierarchies from
Philipp Cimiano
2005-08-04