Next: Overall Process Up: Learning Concept Hierarchies from Previous: Learning Concept Hierarchies from

Introduction

Taxonomies or concept hierarchies are crucial for any knowledge-based system, i.e. a system equipped with declarative knowledge about the domain it deals with and capable of reasoning on the basis of this knowledge. Concept hierarchies are in fact important because they allow to structure information into categories, thus fostering its search and reuse. Further, they allow to formulate rules as well as relations in an abstract and concise way, facilitating the development, refinement and reuse of a knowledge-base. Further, the fact that they allow to generalize over words has shown to provide benefits in a number of applications such as Information Retrieval [62] as well as text clustering [35] and classification [6]. In addition, they also have important applications within Natural Language Processing (e.g. [12]). However, it is also well known that any knowledge-based system suffers from the so-called knowledge acquisition bottleneck, i.e. the difficulty to actually model the domain in question. In order to partially overcome this problem we present a novel approach to automatically learning a concept hierarchy from a text corpus. Making the knowledge implicitly contained in texts explicit is a great challenge. For example, [7] have argued that text writing and reading is in fact a process of background knowledge maintenance in the sense that basic domain knowledge is assumed, and only the relevant part of knowledge which is the issue of the text or article is mentioned in a more or less explicit way. Actually, knowledge can be found in texts at different levels of explicitness depending on the sort of text considered. Handbooks, textbooks or dictionaries for example contain explicit knowledge in form of definitions such as ``a tiger is a mammal'' or ``mammals such as tigers, lions or elephants''. In fact, some researchers have exploited such regular patterns to discover taxonomic or part-of relations in texts [33,10,36,2]. However, it seems that the more technical and specialized the texts get, the less basic knowledge we find stated explicitly. Thus, an interesting alternative is to derive knowledge from texts by analyzing how certain terms are used rather than to look for their explicit definition. In these lines the distributional hypothesis [32] assumes that terms are similar to the extent to which they share similar linguistic contexts. In fact, different methods have been proposed in the literature to address the problem of (semi-) automatically deriving a concept hierarchy from text based on the distributional hypothesis. Basically, these methods can be grouped into two classes: the similarity-based methods on the one hand and the set-theoretical on the other hand. Both methods adopt a vector-space model and represent a word or term as a vector containing features or attributes derived from a certain corpus. There is certainly a great divergence in which attributes are used for this purpose, but typically some sort of syntactic features are used, such as conjunctions, appositions [8] or verb-argument dependencies [34,46,30,24]. The first type of methods is characterized by the use of a similarity or distance measure in order to compute the pairwise similarity or distance between vectors corresponding to two words or terms in order to decide if they can be clustered or not. Some prominent examples for this type of method have been developed by [34,46,30,24,8] as well as [5]. Set-theoretical approaches partially order the objects according to the inclusion relations between their attribute sets [47,56]. In this paper, we present an approach based on Formal Concept Analysis, a method based on order theory and mainly used for the analysis of data, in particular for discovering inherent relationships between objects described through a set of attributes on the one hand, and the attributes themselves on the other [26]. In order to derive attributes from a certain corpus, we parse it and extract verb/prepositional phrase (PP)-complement, verb/object and verb/subject dependencies. For each noun appearing as head of these argument positions we then use the corresponding verbs as attributes for building the formal context and then calculating the formal concept lattice on its basis. Though different methods have been explored in the literature, there is actually a lack of comparative work concerning the task of automatically learning concept hierarchies with clustering techniques. However, as argued by [15], ontology engineers need guidelines about the effectiveness, efficiency and trade-offs of different methods in order to decide which techniques to apply in which settings. Thus, we present a comparison along these lines between our FCA-based approach, hierarchical bottom-up (agglomerative) clustering and Bi-Section-KMeans as an instance of a divisive algorithm. In particular, we compare the learned concept hierarchies in terms of similarity with handcrafted reference taxonomies for two domains: tourism and finance. In addition, we examine the impact of using different information measures to weight the significance of a given object/attribute pair. Furthermore, we also investigate the use of a smoothing technique to cope with data sparseness. The remainder of this paper is organized as follows: Section 2 describes the overall process and Section 3 briefly introduces Formal Concept Analysis and describes the nature of the concept hierarchies we automatically acquire. Section 4 describes the text processing methods we apply to automatically derive context attributes. In Section 5 we discuss in detail our evaluation methodology and present the actual results in Section 6. In particular, we present the comparison of the different approaches as well as the evaluation of the impact of different information measures as well as of our smoothing technique. Before concluding, we discuss some related work in Section 7.

Next: Overall Process Up: Learning Concept Hierarchies from Previous: Learning Concept Hierarchies from

Philipp Cimiano 2005-08-04