[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter1.html .]
[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]
Chapter 1
Multilingual Resources
Editor: Martha Palmer
Contributors:
Nicoletta Calzolari
Khalid Choukri
Christiane Fellbaum
Eduard Hovy
Nancy Ide
Abstract
A searing lesson learned in the last five years is the enormous amount of knowledge required to enable broad-scale language processing. Whether it is acquired by traditional, manual, means, or by using semi-automated, statistically oriented, methods, the need for international standards, evaluation/validation procedures, and ongoing maintenance and updating of resources which can be made available through central distribution centers is now greater than ever. We can no longer afford to (re)develop new grammars, lexicons, and ontologies for each new application, and to collect new corpora when corpus preparation is a nontrivial task. This chapter describes the current state of affairs for each type of resourcecorpora, grammars, lexicons, and ontologiesand outlines what is required in the near future.
1.1 Introduction
Over the last decade, researchers and developers of Natural Language Processing technology have created basic tools that are impacting daily life. Speech recognition saves the telephone company millions of dollars. Text to speech synthesis aids the blind. Massive resources for training and analysis are available in the form of annotated and analyzed corpora for spoken and written language. The explosion in applications has largely been due to new algorithms that harness statistical techniques to achieve maximal leverage of linguistic insights, as well as to the huge increase in power per dollar in computing machinery.
Yet the ultimate goals of the various branches of Natural Language Processingaccurate Information Extraction and Text Summarization (
Chapter 3), focused multilingual Information Retrieval (Chapter 2), fluent Machine Translation (Chapter 4), robust Speech Recognition (Chapter 5)still remain tantalizingly out of reach. The principal difficulty lies in dealing with meaning. However well systems perform their basic steps, they are still not able to perform at high enough levels for real-world domains, because they are unable to sufficiently understand what the user is trying to say or do. The difficulty of building adequate semantic representations, both in design and scale, has limited the fields of Natural Language Processing in two ways: either to applications that can be circumscribed within well-defined subdomains, as in Information Extraction and Text Summarization (Chapter 3); or to applications that operate at a less-than-ideal level of performance, as in Speech Recognition (Chapter 5) or Information Retrieval (Chapter 2).The two major causes of these limitations are related. First, large-scale, all-encompassing resources (lexicons, grammars, etc.) upon which systems can be built are rare or nonexistent. Second, theories that enable the adequately accurate representation of semantics (meaning) for a wide variety of specific aspects (time, space, causality, interpersonal effects, emotions, etc.) do not exist, or are so formalized as to be too constraining for practical implementation. At this time, we have no way of constructing a wide-coverage lexicon with adequately formalized semantic knowledge, for example.
On the other hand, we do have many individual resources, built up over almost five decades of projects in Language Processing, ranging from individual lexicons or grammars of a few thousand items to the results of large multi-project collaborations such as ACQUILEX. We also have access to the work on semantics in Philosophy, NLP, Artificial Intelligence (AI), and Cognitive Science, and in particular to the efforts of large AI projects such as CYC on the construction of semantic knowledge bases (see Section 1.3.4 below). Thus one of our major challenges consists of collecting and reusing what exists, rather than in starting yet again.
The value of standards has long been recognized as a way to ensure that resources are not abandoned when their projects end, but that subsequent projects can build upon what came before. Both in Europe and the US, various more or less coordinated standards efforts have existed, for various resources. In the US, these issues with respect to lexicons have been taken up in a series of recent workshops under the auspices of the ACL Special Interest Group on the Lexicon, SIGLEX. Word sense disambiguation, WSD, was a central topic of discussion at the workshop on Semantic Tagging at the ANLP 1997 conference in Washington chaired by Marc Light, (Kilgarriff, 1997), which featured several working groups on polysemy and computational lexicons. This meeting led to the organization of a follow-on series, SIGLEX98-SENSEVAL and subsequent workshops (SIGLEX99), which address WSD even more directly by including evaluations of word sense disambiguation systems and in-depth discussions of the suitability of traditional dictionary entries as entries in computational lexicons. In Europe, the EAGLES standardization initiative has begun an important movement towards common formats for lexicon standardization and towards coordinated efforts towards standardizing other resources. Such standardization is especially critical in Europe, where multilinguality adds another dimension of complexity to natural language processing issues. The EAGLES report can be found at
http://www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html. Recently, renewed interest in the semi-automated acquisition of resource information (words for lexicons, rules for grammars) has led to a new urgency for the clear and simple formulation of such standards.Though the problem is a long way from being finally solved, the issues are being more clearly defined. In particular there is a growing awareness, especially among younger researchers, that the object is not to prove the truth or correctness of any particular theoretical approach, but rather to agree on a common format that can allow us to merge multiple sources of information. Otherwise we doom ourselves to expending vast amounts of effort in the pursuit of nothing more than duplication. The quickest way to arrive at a common format may be to make very coarse distinctions initially, and then refine the results lateran approach that was anathema some years ago. A common format is required not only so that we can share resources and can communicate information between languages, but to also enable a common protocol for communicating information between different modalities.
Acknowledging two facts is the key to successful future multilingual information processing:
In principle, we consider all of the various major types of information used in Language Processing which includes morphology, parts of speech, syntax, collocations, frequency of occurrence, semantics, discourse, and interpersonal and situational communicative pragmatics. Since there is no way to determine a priori which aspect plays the primary role in a given instance, all of these levels of representation could be equally relevant to a task. In this chapter, however we focus only on the resources currently most critical for continued progress:
Naturally, in parallel to the specification of the nature of elements of each of these entities, the development of semi-automated techniques for acquisition, involving statistical modeling, and efficient and novel algorithms, is crucial. These techniques are discussed in
Chapter 6.1.2 Development of Language Resources: Past, Present and Future
In this section we discuss the role of language resources in a multilingual setting, focusing on the four essentials of Language Resources, distribution, development, evaluation, and maintenance, that pertain equally to all of them. At the beginning of the 1990s, the US was at the vanguard of the production of language resources, having transformed the conclusions of the first workshop on Evaluation of Natural Language Processing Systems into DARPAs MUC, TREC, and later MT and SUMMAC evaluations. Under DARPA and other (primarily military) agency funding, the Language Processing standardization and dissemination activities included:
Since the formation of the European Language Resources Association ELRA in 1995, however, the leadership role has passed to Europe, which is now well ahead of the US in the recognition of the need for standardization, lexical semantics, and multilinguality. Recognizing the strategic role of Language Resources, the CEC launched a large number of projects in Europe in the last decade, many of them in the recent Language Engineering program. In this vision, the language resource activities essential for a coordinated development of the field included the development, evaluation, and distribution of core Language Resources for all EU languages that conformed to agreed upon standards and formats. The Language Engineering projects that coherently implemented (or started to work towards the implementation of) these types of activity include:
The ever-spreading tentacles of the Internet have revived US interest in multi-lingual information processing, with a corresponding renewed interest in relevant language resources. At this point the community will be well served by a coordinated international effort that merges what has been achieved in North America, especially in the areas of evaluation, with what has been achieved in Europe, especially with respect to development and maintenance.
Development
Efficient and effective development in an area as complex as Language Processing requires close cooperation between various different research groups, as well as frequent integration of diverse components. This makes a shared platform of large-coverage language resources and basic components an absolute necessity as a common infrastructure, to ensure:
Though we address the particular needs of individual resources below, they all have an essential need for international collaborations that specifically collect existing resources and evaluation methods, integrate them into unified practical and not-too-complex frameworks, and deliver them to such bodies as LDC and ELRA. This work will not only facilitate Language Processing projects but will prove invaluable in pinpointing gaps and theoretical shortcomings in the coverage and applicability of the resources and evaluation methods.
Evaluation
The importance of evaluations to assess the current state of the art and measure progress of technologies, as discussed in
Chapter 8, is evident. There is a need for an independent player to construct and manage both the data and the evaluation campaigns. However, performing evaluations has proven to be a rather difficult enterprise, and not only for technical reasons. Evaluations with high inherent overheads are often perceived as an unrewarding and possibly disruptive activity. However, in every endeavor in which an appropriate and systematic program of evaluations has evolved, marked progress has been achieved in practical language engineering terms. This phenomenon is discussed further in Chapter 6.
Despite this fact, many key players, (customers and developers) have historically shown little interest in performing substantial evaluations, since they simply cannot afford the sizeable investments required. Unfortunately, the consumer reports appearing in various computer magazines lack the necessary accuracy and methodological criteria to be considered objective, valid evaluations. A further limitation is the lack of access to laboratory prototypes, so that only systems that have already been fielded are available for testing by the customer community. Furthermore, developers prefer to spend their time on development instead of on assessment, particularly if the evaluation is to be public.
As a result, the only remaining players with the requisite financial resources, infrastructure, and social clout are the funding agencies. When they are potential users of the technology they can perform in-house evaluations; examples include the Service de Traduction (translation services) of the CEC and the US Department of Defense evaluations of information retrieval, text summarization, and information extraction (TREC, SUMMAC, and MUC; see
Chapters 2 and 3). They can also include evaluations as a necessary component of systems whose development they are funding, as a method of determining follow-on funding. In such a case however, it is critical to ensure community consensus on the evaluation criteria lest the issues become clouded by the need for funds.Developing evaluation measures for resources is even more complex than evaluating applications, such as summarization and machine translation. With applications, achievement of tasks can be specified with corresponding evaluation of performance being measured against the desired outcome. With resources such as lexicons, however, the evaluation has to determine, in some way, how well the resource supports the functioning of the application. This can only be done if the contribution of the lexicon can be teased apart from the contribution of the other components of the application system and the performance of the system as a whole. Therefore, evaluation of resources is by necessity secondary or indirect, making them especially difficult to perform. An unfortunate result of this has been the proliferation of unused grammars, lexicons, resource acquisition tools, and word taxonomies that, with appropriate revision, could have provided valuable community resources. Constructive evaluations of resources are fundamental to their reusability.
However, there is an inherent danger in tying funding too directly to short-term evaluation schemes: it can have the unfortunate result of stifling innovation and slowing down progress. It is critical for evaluations to measure fundamental improvements in technology and not simply reward the system that has been geared (hacked) most successfully to a particular evaluation scheme. The SIGLEX workshops mentioned above provide an example of a grassroots movement to define more clearly the role of syntax, lexical semantics, and lexical co-occurrence in word sense disambiguation, and as such it is examining not just system performance but the very nature of word sense distinctions. The next five years should see a major shift in evaluations away from purely task oriented evaluations and towards a hybrid evaluation approach that will further our understanding of the task while at the same time focusing on measurable results.
Distribution
As with the LDC in the US, the role of ELRA in Europe as an intermediary between producers and users of language resources greatly simplifies the distribution process by preventing a great deal of unnecessary contractual arrangements and easing sales across borders. MLCC, ELRAs multilingual corpus, for example, consists of data from 6 different newspapers in 6 different languages. ELRA has signed contracts with each provider, and the user who wishes to acquire the set of databases only has to sign a single contract with ELRA. Care is taken to ensure that the language resources are clear of intellectual property rights (IPR) restrictions and are available for commercial and research licenses, with a list of key applications associated with them. (The alternative is a bureaucratic nightmare, in which each user has to sign 6 different contracts, negotiate IPR rights for each one, with 6 different producers, in 6 different countries, under 6 different legal systems. Having a few major distribution sites is clearly the only sane method of making this data available. )
In addition to distributing corpora, both raw and annotated, the next few years should see the addition of grammars, lexicons and ontologies as resources that could be made available through such distribution sites.
Maintenance
Many of the resources mentioned above have just been created or are still in the process of being created. Therefore the issue of maintenance has not really been addressed in either the US or in Europe, although it did provide the topic for a panel discussion at the First International Language Resources conference LREC-98. A question that has already arisen has to do with EuroWordNet, which is linked to WordNet 1.5 (because this was the version when EuroWordNet was begun), although version 1.5 has since been replaced by WordNet 1.6. How can EuroWordNet best be updated to reflect the new version?
Anyone having even the briefest acquaintance with software product cycles will expect that the maintenance of language resources will shortly become a central issue.
1.3 Types of Language Resources
1.3.1 Corpora
Before corpora are suitable for natural language processing work, it is necessary for them to be created and prepared (or "annotated"). The term "annotation" is very broadly construed at present, involving everything from identifying paragraph breaks to the addition of information that is not in any way present in the original, such as part of speech tags. In general, one can divide what is now lumped together under the term "corpus annotation" into three broad categories:
In order to enable more efficient and effective creation of corpora for NLP work, it is essential to understand the nature of each of these phases and establish mechanisms and means to accomplish each. Step (1) can be nearly fully automated, but steps (2) and (3) require more processing overhead as well as significant human intervention. In particular, we need to develop algorithms and methods for automating these two steps. This is especially true for step (2), which has received only marginal attention except in efforts such as the TREC name identification task, and this will require funding. Step (3) has received more attention, since algorithms for identifying complex linguistic elements has typically been viewed as a more legitimate area of research. However, as discussed above, appropriate markups for lexical semantic information are at a very rudimentary stage of development. One of the most important directions for corpora annotation is determining a richer level of annotation that includes word senses, predicate argument structure, noun-phrase semantic categories, and coreference.
It is also critical that we devise means to include information about elements in a text in a way that makes the resulting texts maximally processable and reusable. In particular, it is important to ensure that the markup used to identify text elements is:
Therefore, in order to create corpora that are both maximally usable and reusable, it will be necessary to specify clearly the ways in which the corpora will be used and the capabilities of the tools that will process them. This in turn demands that effort be put into the development of annotation software, and above all, that this development be undertaken in full collaboration with developers of the software that will process this data and the users who will access it. In other words, as outlined in (Ide, 1998) there are two major requirements for advancing the creation and use of corpora in NLP:
1.3.2 Grammars
The development of powerful and accurate grammars was seen as a primary necessity for Language Processing in the 1960s and early 1970s. However, the near impossibility of building a complete grammar for any language has been gradually recognized, as well as the tremendous amount of essential lexically-specific information, such as modifier preferences and idiosyncratic expressive details. This has led to a shift in emphasis away from traditional rule-based grammars for broad-coverage applications. The systems developed for the MUC series of Information Extraction tasks (see
Chapter 3) generally employed short-range Finite State matchers that provided eventual semantic-like output more quickly and reliably than purely syntax-based parsers. However, they did not produce a rich enough syntactic structure to support discourse processing such as co-reference, which imposed a limit on their overall performance. The goal being sought today is a combination of linguistic and statistical approaches that will robustly provide rich linguistic annotation of raw text.The issues involved in developing more traditional rule-based grammar resources were thoroughly addressed in a 1996 report commissioned by the National Science Foundation; see
http://www.cse.ogi.edu/CSLU/HLTsurvey/HLTsurvey.html, whose Chapter 3 covers grammars specifically. In addition, recent advances during the last two years have resulted in significant, measurable progress in broad coverage parsing accuracy. Statistical learning techniques have led to the development of a new generation of accurate and robust parsers which provide very useful analyses of newspaper style documents, and noisier, but still usable analyses in other, similar domains (Charniak, 1995; Collins, 1997; Magerman and Rathnaparkhit, 1997; Srinivas, 1997,). Such parsers are trained on a set of (sentence, tree) pairs, and will then output the most likely parse for a new, novel, sentence.One advantage of statistical methods is their ability to learn the grammar of the language automatically from training examples. Thus the emphasis on human effort shifts from handcrafting a grammar to annotating a corpus of training examples. Human annotation can immediately provide coverage for phenomena outside the range of most handcrafted grammars, and the resulting corpus is a re-usable resource which can be employed in the training of increasingly accurate generations of parsers as its annotations are enriched and technology progresses. The handcrafted grammars can play an important role in the bootstrapping of appropriate grammatical structure, as illustrated by the role Fidditch (Hindle, 1983) played in the development of the Penn TreeBank (Marcus, 1993), and the success of the Supertagger, (Joshi and Srinivas, 1994, Srinivas, 1997), developed from corpora to which XTAG parses had been assigned (XTAG, 1995).
An important next major advance has to come from a closer integration of syntax and lexical semantics, namely, the ability to train these parsers to recognize not just syntactic structures, but structures that are rich with semantic content as well, (Hermjakob and Mooney, 1997). In the same way that the existence of the Penn TreeBank enabled the development of extremely powerful new syntactic analysis methods, moving to the stage of lexical semantics will require a correspondingly richer level of annotation that includes word senses, predicate argument structure, noun-phrase semantic categories and coreference.
In order to both produce such a resource, and perhaps more importantly, to utilize it effectively, we need to team our parsing technologies more closely with lexical resources. This is an important part of the motivation behind lexicalized grammars such as TAG (Joshi, Levy and Takahasi, 1975, Joshi, 1985) and CCG (Steedman, 1996). Tightly interwoven syntactic and semantic processing can provide the levels of accuracy that are required to support discourse analysis and inference and reasoning, which forms the foundation of any natural language processing application. This has important implications for the future directions of both corpora and lexicons as resources, as well as ontologies.
1.3.3 Lexicons
Lexicons are the heart of any natural language processing system. They include the vocabulary that the system can handle, both individual lexical items and multi-word phrases, with associated morphological, syntactic, semantic and pragmatic information. In cases of spoken language systems, they also include pronunciation and phonological information. In machine translation systems, the bilingual and multilingual lexicons provide the basis for mapping from the source language to the target language. The EAGLES report on monolingual lexicons in several languages,
http://www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html, gives a comprehensive description of how morphological and syntactic information should be encoded. Available on-line lexicons for English such as Comlex (Grishman, et al, 1994) and XTAG to a large degree satisfy these guidelines, as do the SIMPLE lexicons being built in Europe for the other language. The EAGLES working group on Lexical Semantics is preparing guidelines for encoding of semantic information.However, to this date these guidelines have not addressed the issue of making sense distinctions. How does the lexicon creator decide to make one, two or more separate entries for the same lexeme? An issue of major concern is the current proliferation of different English lexicons in the computational linguistics community. There are several on-line lexical resources that are being used that make sense distinctions, Longman's, Oxford University Press, (OUP), Cambridge University Press (CUP), Webster's, and WordNet, to name just a few, and they each use very different approaches. In SENSEVAL, the training data and test data was prepared using a set of OUP senses. In order to allow systems using WordNet to compete as well, a mapping from the OUP senses to the WordNet senses was made. The WordNet system builders commented that "OUP and WordNet carve up the world in different ways. It's possible that WordNet is more fine-grained in some instances, but in the map for the words in SENSEVAL, the OUP grain was generally finer (about 240 WN entries for the SENSEVAL words and about 420 OUP entries.) More than anything, the grain is not necessarily uniform -- not within WordNet, not within OUP." This is true of dictionaries in general. They make different decisions about how to structure entries for the same words, decisions which are all equally valid, but simply not compatible. There was quite a bit of concern expressed, both at the workshop, and afterwards, that this makes it impossible to create performance-preserving mappings between dictionaries.
This is an incompatibility with consequences that are for more wide-spread than the comparison of word sense disambiguation systems. Sense inventories, or lexicons, as the core of an information processing application, are critical as well as being one of the most labor intensive components. Many existing natural language processing applications are described as domain-specific, and this primarily describes the lexicon being used, which contains the domain-specific senses for the vocabulary that is relevant to that application. Because of this incompatibility, it is very unlikely that lexicons from two different applications could be readily merged to create a new system with greater range and flexibility. The task of merging the lexicons could be just as labor intensive as the task of building them in the first place. Even more sweeping is the impact on multilingual information processing. All of these tasks require bilingual lexicons that make the mapping from English to French or German or Japanese. Many of these bilingual lexicons are currently being built, but they are all mapping to different English lexicons which are themselves incompatible. The problem of merging two different domain-specific English to French bilingual lexicons is an order of magnitude larger than the problem of merging two English domain-specific lexicons. Then the problem of trying to integrate a bilingual lexicon involving a third language, such as Korean, that was mapped to yet another incompatible English lexicon, requires that it be done all over again. The sooner we can regularize our representation of English computational lexicons, the less work we will have to do in the future.
Regularizing the English computational lexicon is not a trivial task. Creating a consensus on grammatical structure for the TreeBank required posting guidelines that described literally hundreds of distinct grammatical structures. Where lexical entries are concerned the numbers are in the hundreds of thousands. The first step is simply agreeing on criteria for deciding when two different usages should be considered separate senses and when they should not, and should that be allowed to change depending on the context? Once these general principles have been determined, then the business of revising one of the existing on-line lexicons, preferably WordNet since it is being used the most widely, can begin. Only when the criteria for sense distinctions has been agreed upon, can we create reliable sense-tagged corpora for machine learning purposes, and move our information processing systems onto the next critical stage.
Lexicon Development
There is increased recognition of the vital role played by lexicons (word lists with associated information), when fine tuning general systems to particular domains.
Due to the extremely fluid and ever-changing nature of language, lexicon development poses an especially difficult challenge. No static resource can ever be adequate. In addition, as soon as large-scale generic lexicons with different layers of encoded information (morphological, syntactic, semantic, etc.) are created, they will still need to be fine-tuned for use in specific applications.
Generic and domain-specific lexicons are mutually interdependent. This makes it vital, for any sound lexicon development strategy, to accompany core static lexicons with dynamic means for enriching and integrating thempossibly on the flywith many types of information. This global view eliminates the apparent dichotomy between static vs. dynamically built (or incremental) resources, encompassing the two approaches in a more comprehensive perspective that sees the two as complementary and equally necessary facets of the same problem. In the past few years, steps towards this objective have been taken by a consistent number of groups all over the world, with many varied research and development efforts aimed at acquiring linguistic and, more specifically, lexical, information from corpora. Among the EC projects working in this direction we mention LE SPARKLE (combining shallow parsing and lexical acquisition techniques capable of learning aspects of word knowledge needed for LE applications) and LE ECRAN.
Gaps in Static Lexicons
As Gross clearly stated already in the 1970s (Gross 1984), most existing lexicons contain simple words, while actually occurring texts such as newspapers are composed predominantly of multi-word phrases. Still, however, the phrasal nature of the lexicon has not been addressed properly, and is a major limitation of available resources. Correcting this will require corpora to play a major role, but also methodologies of extraction, and linguistic methods of classification.
As mentioned above, resources for evaluation and the evaluation of resources is a major open problem in lexicon development, validation, and reuse.
While large morphosyntactically annotated corpora exist for many European languages, built for example in MULTEXT and for all the EU languages in PAROLE, and also the production of large-size syntactically annotated corpora has started for some EU languages, semantically tagged corpora do not yet exist. This is rapidly becoming a major requirement for developing application-specific tools.
Critical Priorities in Lexicon Development
Computational lexicons, like human dictionaries, often represent a sort of stereotypical/theoretical language. Carefully constructed or selected large corpora are essential sources of linguistic knowledge for the extensive description of the concrete use of the language in real text. To be habitable and practical, a computational lexicon has to faithfully represent the apparently irregular facts (evidenced by corpus analysis), and the divergences by actual usage from what is potentially/in theory acceptable. We need to clearly representand separatewhat is allowed, but only very rarely instantiated, from what is both allowed and actually used. To this end, more robust and flexible tools are needed for (semi-) automatic induction of linguistic knowledge from texts. This usually implies a bootstrapping method, because extraction presupposes some capability of automatically analyzing the raw text in various ways, which first requires a lexicon. The induction phase must however be followed by a linguistic analysis and classification phase, if the induced data is to be used and merged together with already available resources. Therefore:
The EC-funded projects provide an excellent framework for facilitating these types of interactions, by providing the necessary funding for combining the efforts of different and complementary groups. This complementarity of existing competence should continue to be sought and carefully planned.
1.3.4 Ontologies
Background
As described in
Chapters 2, 3, and 4, semantic information is central in improving the performance of Language Processing systems. Lexical semantic information such as semantic class constraints, thematic roles, and lexical classifications need to be closely coupled to the semantic frameworks used for language processing. Increasingly, such information is represented and stored in so-called ontologies.An ontology can be viewed as an inventory of concepts, organized under some internal structuring principle. Ontologies go back to Aristotle; more recently (in 1852), Peter Mark Roget published his Thesaurus of English Words and Phrases Classified and Arranged so as to Facilitate the Expression of Ideas and Assist in Literary Composition. The organization of the words in a thesaurus follows the organization of the concepts that the words express and not vice versa, as in a dictionary; a thesaurus can therefore be considered to be an ontology. Rogets thesaurus has been revised (Chapman, 1977), but not significantly altered. However, for computational purposes, a consistently structured ontology is needed for automatic processing, which is over and beyond what is provided by Roget.
The set of concepts definition, however, begs the notoriously difficult question: What is a concept? Over the past decade, two principal schools of thought have emerged on this question. Researchers in Language Processing circles, typically, have simplified the answer to this question by equating concept with lexicalized concept, i.e., a concept that is expressed by one or more words of a language. (The assumption that more than one word may refer to the same concept reflects the familiar phenomenon of synonymy.) Under this view, an ontology is the inventory of word senses of a languageits semantic lexicon. This definition has the advantage that it contains only those concepts that are shared by a linguistic community. It excludes possible concepts like my third cousins black cat, which are idiosyncratic to a given speaker and of no interest to psychologists, philosophers, linguists, etc. Relating ones ontology with the lexicon also excludes potential concepts expressible by ad-hoc compounds like paper clip container, which can be generated on the fly but are not part of the core inventory, as their absence from dictionaries shows. Moreover, we avoid the need to define words by limiting our inventory to those strings found in standard lexical reference works. Thus the Language Processing ontologies that have been built resemble Rogets thesaurus in that they express the relationships among concepts at the granularity of words. Within Artificial Intelligence (AI), in contrast, a concept has roughly been identified with some abstract notion that facilitates reasoning (ideally, by a system and not just by the ontology builder), and the ontologies that have been built have also been called Domain Models or Knowledge Bases. To differentiate the two styles, the former are often referred to as terminological ontologies (or even just term taxonomies), while the latter are sometimes called conceptual or axiomatized ontologies.
The purpose of terminological ontologies is to support Language Processing. Typically, the content of these ontologies is relatively meager, with only a handful of relationships on average between any given concept and all the others. Neither the concepts nor the inter-concept relationships are formally defined, and are typically only differentiated by name and possibly textual definition. The core structuring relationship is usually called is-a and expresses the rough notion of "a kind of" or conceptual generalization. Very often, to support the wide range of language, terminological ontologies contain over 100,000 entities, and tend to be linked to lexicons of one or more languages that provide the words expressing the concepts. The best-known example of a terminological ontology is WordNet (Miller, 1990; Fellbaum, 1998), which as an on-line resource of reference has had a major impact on the ability of researchers to conceive of different semantic processing techniques. However, before the collection of truly representative large-scale sets of semantic senses can begin, the field has to develop a clear consensus on guidelines for computational lexicons. Indeed, attempts are being made, including (Melcuk, 1988; Pustejovsky, 1995; Nirenburg et al., 1992; Copestake and Sanfilippo, 1993; Lowe et al., 1997; Dorr, 1997; Palmer, 1998). Other terminological ontologies are Mikrokosmos (Viegas et al., 1996), used for machine translation, and SENSUS (Knight and Luk, 1994; Hovy, 1998), used for machine translation of several languages, text summarization, and text generation.
In contrast, the conceptual ontologies of AI are built to support logic-based inference, and often include substantial amounts of world knowledge in addition to lexical knowledge. Thus the content of each concept is usually richer, involving some dozens or even more axioms relating a concept to others (for example, a car has-part wheels, the usual-number of wheels being 4, the wheels enabling motion, and so on). Often, conceptual ontologies contain candidates for concepts for which no word exists, such as PartiallyTemporalAndPartiallySpatialThing. Recent conceptual ontologies reflect growing understanding that two core structuring relationships are necessary to express logical differences in generalization, and that concepts exhibit various facets (structural, functional, meronymic, material, social, and so on). Thus a glass, under the material facet, is a lot of glass matter; under the meronymic facet, it is a configuration of stem, foot, and bowl; under the functional facet, it is a container from which one can drink and through which one can see; under one social facet, it is the object that the bridegroom crushes at a wedding; see (Guarino, 1997). Given the complex analysis required to build such models, and the interrelationships among concepts, conceptual ontologies tend to number between 2,000 and 5,000 entities. The largest conceptual ontology, CYC (Lenat and Guha, 1995) contains approx. 40,000 concepts; every other conceptual ontology is an order of magnitude smaller. (In contrast, as mentioned above, WordNet has roughly 100,000 concepts.) Unfortunately, given the complexity of these ontologies, internal logical consistency is an ongoing and serious problem.
Ontologies contain the semantic information that enables Language Processing systems to deliver higher quality performance. They help with a large variety of tasks, including word sense disambiguation (in "he picked up the bench", "bench" cannot stand for judiciary because it is an abstraction), phrase attachment (in "he saw the man with the telescope", it is more likely that the telescope was used to see the man than that it is something uniquely associated with the man, because it is an instrument for looking with), and machine translation (as an inventory of the symbols via which words in different languages can be associated). The obvious need for ontologies, coupled with the current lack of many large examples, leads to the vexing question of exactly how to build useful multi-purpose ontologies.
Unfortunately, ontologies are difficult and expensive to build. To be useful, they have to be large and comprehensive. Therefore, the more an ontology can be shared by multiple applications, the more useful it is. However, it is not so much a matter of designing the right ontology (an almost meaningless statement, given our current lack of understanding of semantics), but of having a reasonable one that can serve impelling purposes, and on which some consensus between different groups can be reached. In this light, creating a consensus ontology becomes a worthwhile enterprise; indeed, this is precisely the goal of the ANSI group on Ontology Standards (Hovy, 1998), and is a critical task for the EAGLES Lexicon/Semantics Working Group. Initiatives of this kind must converge and act in synergy to be fruitful for the Language Processing community.
Open Questions in Language Processing Ontologies
WordNet (Miller, 1995; Fellbaum, 1998) is a lexical database organized around lexicalized concepts or synonym sets. Unlike Rogets largely intuitive design, WordNet was originally motivated by psycholinguistic models of human semantic memory and knowledge representation. Supported by data from word association norms, WordNet links together its synonym sets (lexicalized concepts) by means of a small number of conceptual-semantic and lexical relations. The most important ones are hyponymy (the superclass relation) and meronymy (the part-whole relation) or concepts expressible by nouns, antonymy for adjectives, and several entailment relations for verbs (Miller, 1990; Fellbaum, 1998). Whereas WordNet is entirely hand-constructed, (Amsler, 1980) and (Chodorow et al., 1985) were among those who tried to extract hyponymically related words automatically from machine-readable dictionaries by exploiting their implicit structure. (Hearst, 1998) proposed to find semantically related words by finding specific phrase patterns in texts.
SENSUS (Knight and Luk, 1994; Hovy, 1998) is a derivative of WordNet that seeks to make it more amenable to the tasks of machine translation and text generation. The necessary alterations required the inclusion of a whole new top level of approx. 300 high-level abstractions of English syntax called the Upper Model (Bateman et al., 1986), as well as a concomitant retaxonomization of WordNet (separated into approx. 100 parts) under this top level. To enable machine translation, SENSUS concepts act as pivots between different language words; its concepts are linked to lexicons of Japanese, Spanish, Arabic, and English.
Mikrokosmos (Viegas et al., 1996; Mahesh 1995) is an ontology of approx. 5,000 high-level abstractions, out of which lexical items are defined for a variety of languages, also in the task of machine translation.
The experience of designing and building these and other ontologies all shared the same major difficulties. First among these is the identification of the concepts. The top-level concepts in particular remain a source of controversy, because these very abstract notions are not always well lexicalized and can often be referred to only by phrases such as causal agent and physical object. Second, concepts fall into distinct classes, expressible by different parts of speech: Entities are referred to by nouns; functions, activities, events, and states tend to be expressed by verbs, and attributes and properties are lexicalized by adjectives. But some concepts do not follow this neat classification. Phrases and chunks such as "wont hear of it" and "the X-er the Y-er"(Fillmore, 1988; Jackendoff, 1995) arguably express specific concepts, but they cannot always be categorized either lexically or in terms of high-level concepts.
Second, the internal structure of proposed ontologies is controversial. WordNet relates all synonym sets by means of about a dozen semantic relations; (Melcuk, 1988) proposes over fifty. There is little solid evidence for the set of all and only useful relations, and intuition invariably comes into play. Moreover, it is difficult to avoid the inherent polysemy of semantic relations. For example, (Chaffin et al., 1988) analyzed the many different kinds of meronymy, and similar studies could be undertaken for hyponymy and antonymy (Cruse, 1986). Another problem is the fact that semantic relations like antonymy, entailment, meronymy, and class inclusion are themselves concepts, raising the question of circularity.
Two major approaches currently exist concerning the structure of ontologies. One approach identifies all elemental concepts as factors, and then uses concept lattices to represent all factor combinations under which concepts can be taxonomized (Wille, 1992). The more common approach is to taxonomize concepts using the concept generalization relation as structural principle. While the debate concerning the relative merits of both approaches continues, only the taxonomic approach has been empirically validated with the construction of ontologies containing over 10,000 concepts.
Third, a recurrent problem relates to the question of multiple inheritance. For example, a dog can be both an animal and a pet. How should this dual relation be represented in an ontology? WordNet and SENSUS treat both dog and pet as kinds of animals and ignores the type-role distinction, because it seems impossible to construct full hierarchies from role or function concepts such as pet. But clearly, there is a difference between these two kinds of concepts. Casting this problem as one of conceptual facets, Pustejovsky (1995) proposed a solution by creating a lexicon with underspecified entries such as newspaper together with structured semantic information about the underlying concept. Depending on the context in which the word occurs, some of its semantic aspects are foregrounded whereas others are not needed for interpreting the context, e.g., the building vs. the institution aspects of newspaper. Guarino (1997) takes this approach a step further, and identifies at least 8 so-called Identity Criteria that each express a different facet of conceptual identity. Such approaches may well offer a satisfactory solution for the representation of the meaning of complex concepts.
Despite their quasi-semantic nature, ontologies based on lexicons do not map readily across languages. It is usually necessary to find shared concepts underlying the lexical classifications in order to facilitate multilingual mappings. Currently, the EuroWordNet project is building lexical databases in eight European languages patterned after WordNet but with several important enhancements (Vossen, et al., 1999). EuroWordNet shows up crosslinguistic lexicalization patterns of concepts. Its interlingual index is the union of all lexicalized concepts in the eight languages, and permits one to examine which concepts are expressed in all languages, and which ones are matched with a word in only a subset of the languages, an important perspective to gain for ontology theoreticians. Multilingual applications are always a good test of ones theories. EuroWordNet is testing the validity of the original WordNet and the way it structures the concepts lexicalized in English. Crosslinguistic matching reveals lexical gaps in individual languages, as well as concepts that are particular to one language only. Eventually, an inspection of the lexicalized concepts shared by all eight member languages should be of interest, as well as the union of the concepts of all languages. Similar data should be available from the Mikrokosmos project. To yield a clear picture, ontologies from as wide a variety of languages as possible should be compared, and the coverage should be comparable for all languages.
The Special Challenge of Verbs
It is not surprising that WordNets noun classification has been used more successfully than the verb classification, or that the majority of the entries in the Generative Lexicon are of nouns. By their very nature verbs involve multiple relationships among many participants which can themselves be complex predicates. Classifying and comparing such rich representations is especially difficult.
An encouraging recent development in linguistics provides verb classifications that have a more semantic orientation (Levin, 1993, Rappaport Hovav and Levin 1998). These classes, and refinements on them (Dang et al., 1998; Dorr, 1997), provide the key to making generalizations about regular extensions of verb meanings, which is critical to building the bridge between syntax and semantics. Based on these results, a distributional analysis of properly disambiguated syntactic frames should provide critical information regarding a verbs semantic classification, as is being currently explored by (Stevenson & Merlo, 1997). This could make it possible to use the syntactic frames occurring with particular lexical items in large parsed corpora to automatically form clusters that are both semantically and syntactically coherent. This is our doorway, not just to richer computational lexicons, but to a methodology for building ontologies. The more we can rely on semi-automated and automated methods for building classifications, even those tailored to specific domains, the more objective these classifications will be, and the more reliably they will port to other languages.
Recent encouraging results in the application of statistical techniques to lexical semantics lend credence to this notion. There have been surprising breakthroughs in the use of lexical resources for semantic analysis in the areas of homonym disambiguation (Yarowsky, 1995) and prepositional phrase attachment (Stetina and Nagao, 1997). There are also new clustering algorithms that create word classes that correspond to linguistic concepts or that aid in language modeling tasks (Resnik, 1993; Lee et al., 1997). New research projects exploring the application of linguistic theories to verb representation promise to advance our understanding of computational lexicons, FRAMENET (Lowe, et al., 1997) and VERBNET (Dang et al., 1998). The next few years should bring dramatic changes to our ability to use and represent lexical semantics.
The Future
Ontologies are no longer of interest to philosophers only, but also to linguists, computer scientists, and people working in information and library sciences. Creating an ontology is an attempt to represent human knowledge in a structured way. As more and more knowledge, expressed by words and documents, is available to larger numbers of people, it needs to be made accessible easily and quickly. Ontologies permit one to efficiently store and retrieve great amounts of data by imposing a classification and structure on the knowledge in these data.
There is only one way in which progress in this difficult and important question can be made effectively. Instead of re-building ontologies anew for each domain and each application, the existing ontologies must be pooled, converted to the same notation, and cross-indexed, and one or more common, standardized, and maximally extensive ontologies should be created. The semi-automated cross-ontology alignment work reported in (Knight and Luk, 1994; Agirre et al., 1994; Rigau and Agirre, 1995; Hovy, 1996; Hovy, 1998) illustrates the extent to which techniques can be developed to exploit the ontology structure, concept names, and concept definitions.
If this goal, shared by a number of enterprises, including the ANSI Ad Hoc Committee on Ontology Standardization (Hovy, 1998), can indeed be realized, it will constitute a significant advance for semantic-based Language Processing.
1.4 Conclusion
The questions raised here are likely to continue to challenge us in the near future; after all, ontologies have occupied peoples minds for over 2,500 years. Progress and understanding are likely to come not from mere speculation and theorizing, but from the construction of realistically sized models such as WordNet, CYC (Lenat and Guha, 1990), Mikrokosmos (Viegas et al., 1996, Mahesh, 1996), and SENSUS, the ISI multilingual ontology (Knight and Luk, 1994; Hovy, 1998).
One next major technical advance is almost certain to come from a closer integration of syntax and lexical semantics, most probably via the ability to train statistical parsers to recognize not just syntactic structures, but structures that are rich with semantic content as well. In the same way that the existence of the Penn TreeBank enabled the development of extremely powerful new syntactic analysis methods, the development of a large resource of lexical semantics (either in the form of an ontology or a semantic lexicon) will facilitate a whole new level of processing. Construction of such a semantic resource requires corpora with a correspondingly richer level of annotation. These annotations must include word senses, predicate argument structure, noun-phrase semantic categories and coreference, and multilingual lexicons rich in semantic structure that are coupled to multilingual ontologies. Tightly interwoven syntactic and semantic processing can provide the levels of accuracy that are required to support discourse analysis and inference and reasoningthe foundation of any natural language processing application.
The thesis of this chapter is the recognition of the essential role that language resources play in the infrastructure of Language Processing, as the necessary common platform on which new technologies and applications must be based. In order to avoid massive and wasteful duplication of effort, public fundingat least partiallyof language resource development is critical to ensure public availability (although not necessarily at no cost). A prerequisite to such a publicly funded effort is careful consideration of the needs of the community, in particular the needs of industry. In a multilingual setting such as todays global economy, the need for standards is even stronger. In addition to the other motivations for designing common guidelines, there is the need for common specifications so that compatible and harmonized resources for different languages can be built. Finally, clearly defined and agreed upon standards and evaluations will encourage the widespread adoption of resources, and the more they are used the greater the possibility that the user community will be willing to contribute to further maintenance and development.
1.5 References
Agirre, E., X. Arregi, X. Artola, A. Diaz de Ilarazza, K. Sarasola. 1994. Conceptual Distance and Automatic Spelling Correction. Proceedings of the Workshop on Computational Linguistics for Speech and Handwriting Recognition. Leeds, England.
Amsler, R.A. 1980. The Structure of the Merriam-Webster Pocket Dictionary. Ph.D. dissertation in Computer Science, University of Texas, Austin, TX
Bateman, J.A., R.T. Kasper, J.D. Moore, and R.A. Whitney. 1989. A General Organization of Knowledge for Natural Language Processing: The Penman Upper Model. Unpublished research report, USC/Information Sciences Institute, Marina del Rey, CA.
Chaffin, R., D.J. Herrmann, and M. Winston. 1988. A taxonomy of part-whole relations: Effects of part-whole relation type on relation naming and relations identification. Cognition and Language 3 (132).
Chapman, R. 1977. Rogets International Thesaurus, Fourth Edition. New York: Harper and Row.
Charniak, E. 1995. Parsing with Context-Free Grammars and Word Statistics. Technical Report: CS-95-28, Brown University.
Chodorow, M., R. Byrd, and G. Heidorn. 1985. Extracting semantic hierarchies from a large on-line dictionary. Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics (299304).
Collins, M. 1997. Three generative, lexicalised models for statistical parsing. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Madrid, Spain.
Copestake, A. and A. Sanfilippo. 1993. Multilingual lexical representation. Proceedings of the AAAI Spring Symposium: Building Lexicons for Machine Translation. Stanford University, California.
Cruse, D.A. 1986. Lexical Semantics. Cambridge: Cambridge University Press.
Dang, H., K. Kipper, M. Palmer, and J. Rosenzweig. 1998. Investigating regular sense extensions based on intersective Levin classes. Proceedings of ACL98. Montreal, Canada.
Dorr, B. 1997. Large-scale dictionary construction for foreign language tutoring and interlingual machine translation. Machine Translation12 (155).
Fellbaum, C. 1998. (ed.) WordNet: An On-Line Lexical Database and Some of its Applications. Cambridge, MA: MIT Press
Fillmore, C., P. Kay, and C. OConnor. 1988. Regularity and idiomaticity in grammatical construction. Language 64 (501568).
Grishman, R., Macleod C., and Meyers, A. 1994. Comlex Syntax: Building a Computational Lexicon, Proc. 15th Int'l Conf. Computational Linguistics (COLING 94), Kyoto, Japan, August.
Gross, M. 1984.Lexicon-Grammar and the Syntactic Analysis of French, Proceedings of the 10th International Conference on Computational Linguistics (COLING'84), Stanford, California.
Guarino, N. 1997. Some Organizing Principles for a Unified Top-Level Ontology. New version of paper presented at AAAI Spring Symposium on Ontological Engineering, Stanford University, March 1997.
Hearst, M. 1998. Automatic Discovery of WordNet Relations. In C. Fellbaum (ed), WordNet: An On-Line Lexical Database and Some of its Applications (131151). Cambridge, MA: MIT Press
Hermjakob, U. and R.J. Mooney. 1997. Learning Parse and Translation Decisions from Examples with Rich Context. Proceedings of the ACL/EACL Conference. Madrid, Spain (482487).
Hindle, D. 1983. User manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory.
Hovy, E.H. 1996. Semi-Automated Alignment of Top Regions of SENSUS and CYC. Presented to ANSI Ad Hoc Committee on Ontology Standardization. Stanford University, Palo Alto, September 1996.
Hovy, E.H. 1998. Combining and Standardizing Large-Scale, Practical Ontologies for Machine Translation and Other Uses. Proceedings of the First International Conference on Language Resources and Evaluation (LREC). Granada, Spain.
Jackendoff, R. 1995. The Boundaries of the Lexicon. In M. Everaert, E.J. van den Linden, A. Schenk, and R. Schreuder, (eds), Idioms: Structural and Psychological Perspectives. Hillsdale, NJ: Erlbaum Associates.
Joshi, A.K. 1985. Tree Adjoining Grammars: How much context Sensitivity is required to provide a reasonable structural description. In D. Dowty, L. Karttunen, and A. Zwicky (eds), Natural Language Parsing (206250). Cambridge: Cambridge University Press.
Joshi, A. and L. Levy, and M. Takahashi. 1975. Tree Adjunct Grammars. Journal of Computer and System Sciences.
Joshi, A.K. and B. Srinivas. 1994. Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing, Proceedings of the 17th International Conference on Computational Linguistics (COLING-94). Kyoto, Japan.
Kilgarriff, A. 1997. Evaluating word sense disambiguation programs: Progress report. Proceedings of the SALT Workshop on Evaluation in Speech and Language Technology. Sheffield, U.K.
Knight, K. and S.K. Luk. 1994. Building a Large-Scale Knowledge Base for Machine Translation. In Proceedings of the AAAI Conference.
Lenat, D.B. and R.V. Guha. 1990. Building Large Knowledge-Based Systems. Reading: Addison-Wesley.
Lowe, J.B., C.F. Baker, and C.J. Fillmore. 1997. A frame-semantic approach to semantic annotation. Proceedings 1997 Siglex Workshop, ANLP97. Washington, D.C.
Mahesh, K. 1996. Ontology Development for Machine Translation: Ideology and Methodology. New Mexico State University CRL report MCCS-96-292.
Marcus, M., B. Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics Journal, Vol. 19.
Melcuk, I. 1988. Semantic description of lexical units in an explanatory combinatorial dictionary: Basic principles and heuristic criteria. International Journal of Lexicography (165188).
Miller, G.A. 1990. (ed.). WordNet: An on-line lexical database. International Journal of Lexicography 3(4) (235312).
Nirenburg, S., J. Carbonell, M. Tomita, and K. Goodman. 1992. Machine Translation: A Knowledge-Based Approach. San Mateo: Morgan Kaufmann.
Lee, L., Dagan, I. and Pereira, F. 1997. Similarity-based methods for word sense disambiguation. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Madrid, Spain.
Palmer, M. 1998. Are WordNet sense distinctions appropriate for computational lexicons? Proceedings of Senseval, Siglex98. Brighton, England.
Pustejovsky, J. 1995. The Generative Lexicon. MIT Press
Rappaport Hova, M and B. Levin. 1998. Building Verb Meanings. In M. Butt and W. Geuder (eds.) The Projection of Arguments. Stanford, CA, CSLI Publications.
Ratnaparkhi, A. 1997. A Linear Observed Time Statistical Parser Based on Maximum Entropy Models. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing.
Resnik, P. 1993. Selection and Information: A Class-Based Approach to Lexical Relationships. Ph.D. thesis, University of Pennsylvania, Department of Computer and Information Sciences, 1993.
Rigau, G. and E. Agirre. 1995. Disambiguating Bilingual Nominal Entries against WordNet. Proceedings of the 7th ESSLI Symposium. Barcelona, Spain.
Srinivas, B. 1997. Performance Evaluation of Supertagging for Partial Parsing. Proceedings of Fifth International Workshop on Parsing Technology, Boston.
Steedman, M. 1996. Surface Structure and Interpretation. Cambridge, MA: MIT Press.
Stetina, J. and M. Nagao. 1997. Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary. Proceedings of the Fifth Workshop on Very Large Corpora (6680). Beijing and Hong Kong.
Stevenson, S. and P. Merlo. 1997. Lexical structure and parsing complexity. Language and Cognitive Processes 12(2/3) (349399).
Viegas, E., K. Mahesh, and S. Nirenburg. 1996. Semantics in Action. Proceedings of the Workshop on Predicative Forms in Natural Language and in Knowledge Bases, (108115). Toulouse, France.
Vossen, P., et al. 1999. EuroWordNet. Computers and the Humanities, special issue (in press).
Wille, R. 1992. Concept lattices and conceptual knowledge systems. Computers and Mathematics with Applications 23 (493515).
The XTAG-Group. 1995. A Lexicalized Tree Adjoining Grammar for English. Technical Report IRCS 95-03, University of Pennsylvania. Updated version available at
http://www.cis.upenn.edu/xtag/tr/tech-report.html.Yarowsky, D. 1995. Three Machine Learning Algorithms for Lexical Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania, Department of Computer and Information Sciences.