



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The challenges and potential solutions for extracting semantic hierarchies from dictionaries to build large lexical-semantic databases for Natural Language Processing (NLP) systems. The text focuses on the use of everyday dictionaries and the issues of incomplete information and a lack of distinction among terms. The document also proposes methods for extracting hypernyms and handling circular definitions.
Typology: Exercises
1 / 6
This page cannot be seen from the preview
Don't miss anything!
J e a n V ~ r o n i s 1.2and N a n c y I d e t
tDepartrnent of Computer Science VASSAR COLLEGE Poughkeepsie, New York 12601 (U.S.A.) :~Groupe Representation et Traitement des Connalssances CF_.~E NATIONALDE LA RECHERCHESCIENTIFIQUE 31, Ch. Joseph Aiguier 13402 Marseille Cedex 09 (France)
In this paper we provide a quantitative evaluation of information automatically extracted from machine readable dictionaries. Our results show that for any one dictionary, 55-70% of the extracted information is garbled in some way. However, we show that these results can be dramatically reduced to about 6% by combining the information extracted from five dictionaries. It therefore appears that even if individual dictionaries are an unreliable source of semantic information, multiple dictionaries can play an important role in building large lexical-semantic databases.
In recent years, it has become increasingly clear that the limited size of existing computational lexicons and the poverty of the semantic information they contain represents one of the primary bottlenecks in the development of realistic natural language processing (NLP) systems. The need for extensive lexical and semantic databases is evident in the recent initiation of a number of projects to construct massive generic lexicons for NLP (project GENELEX in Europe or EDR in Japan). The manual coustruction of large lexical-semantic databases demands enormous human resources, and there is a growing body of research into the possibility of automatically extracting at least a part of the required lexical and semantic informati'on from everyday dictionaries. Everyday dictionaries are obviously not structured in a way that enables their immediate use in NLP systems, but several Studies have shown that relatively simple procedures can be used to extract taxonomies and various other semantic relations (for example, Amsler, 1980; Calzolari, 1984; Cbodorow, Byrd, and Heidorn, 1985; Markowitz, Ahlswede, and
1988; Vtronis and Ide, 1990~ Klavans, Chodorow, and
However, it remains to be seen whether information automatically extracted from dictionaries is sufficiently complete and coherent to be actually usable in NLP systems. Although there is concern over the quality of automatically extracted lexical information, very few empirical studies have attempted to assess it systematically, and those that have done so have been restricted to consideration of the quality of grammatical information (e.g., Akkerman, Masereeuw, and Meijs, 1985). No evaluation of automatically extracted semantic information has been published.
The authors would like to thank Lisa Lassck and Anne Gilman for their contribution to this work.
In this paper, we report the results of a quantitative evaluation of automatically extracted sernanuc data. Our results show that for any one dictionary, 55-70% of the extracted information is garbled in some way. These results at first call into doubt the validity of automatic extraction from dictionaries. However, in section 4 we show that these results can be dramatically reduced to about 6% by several means--most significantly, by combining t h e information extracted from five dictionaries. It therefore appears that even if individual dictionaries are an unreliable source of semantic information, multiple dictionaries can play an important role in building large lexical-semantic databases.
Our strategy involves automatically extracting hypernyms from five English dictionaries for a limited corpus. To determine where problems exist, the resulting hierarchies for each dictionary are compared to an "ideal" hierarchy constructed by hand. The five
We begin with the most straightforward case in order to determine an upper bound for the results. We deal with words within a domain which poses few modelling problems, and we focus on hyperonymy, which is probably the least arguable semantic relation and has been shown to be the easiest to extract. If the results are poor under such favorable constraints, we can foresee that they will be poorer for more complex (abstract) domains and less clearly cut relations. An ideal hicrarchy probably does not exist for the entire dictionary; however, a fair degree of consensus seems possible for carefully chosen terms within a very restricted domain. We have therefore selected a corpus of one hundred kitchen utensil terms, each representing a concrete, individual object--for example, cup, fork, saucepan, decanter, etc. All of the terms are count nouns. Mass nouns, which can cause problems, have been excluded (for example, the mass noun cutlery is not a hypernym of knife). Other idiosyncratic cases, such as chopsticks (where it is not clear if the utensil is one object o r a pair of objects) have also been eliminated from the corpus. This makes it easy to apply simple tests for hyperonymy, which, for instance, enable us to say that Y is a hypcmym of X if "this is an X" entails but is not entailed by "this is a Y" (Lyons, 1963). Chodorow, Byrd, and Heidorn (1985) proposed a heuristic for extracting hypernyms which exploits the fact that definitions for nouns typically give a hypemym
term as the head of the defining noun phrase. Consider the following examples:
d i p p e r a ladle used for dipping... ICEDi ladle a long-handled spoon... ICED] s p o o n a metal, wooden, or plastic utensil... ICED]
In very general terms, the heuristic consists of extracting the word which precedes the first preposition, relative pronoun, or participle encountered in the definition text. When this word Is "empty" (e.g. one, any, kind, class) the true hyperuym is the head of the noun phrase following the preposition of'.
slice any of various utensils... [CEDI
Automatically extracted hierarchies are necessarily tangled (Amsler, 1980) because many words are polysemous. For example, in the CED, the word pan has the following senses (among others):
pan! l.a a wide metal vessel... ICEDI pan 2 1 the leaf of the betel tree.., iCED]
The CED also gives pan as the hypemym for saucepan, which taken together yields the hierarchy in figure l.a. The tangled hierarchy is problematic because, following the path upwards from saucepan, we find that saucepan can be a kind of leaf. This is clearly erroneous. A hierarchy utilizing senses rather than words would not be tangled, as shown in figure 1.b. In our study, the hierarchy waS disambiguated by hand. Sense disambiguation in dictionary definitions is a difficult problem, and we will not address it here; this problem is the focus of much current research and is considered in depth elsewhere (e.g., Byrd et al., 1987; Byrd, 1989; Vtronis and Ide, 1990; Klavans, Chodorow, and Wacholder, 1990; Wilks et al., 1990).
vessel leaf vessel I leaf l
I I saucepan saucepan I
a) v,,ordhitrarchy b) sense hierarchy
F i g u r e I : Sense-tangled" hierarchy
Hierarchies constructed with methods such as those outlined in section 2 show, upon close inspection, several serious problems. In this section, we describe thc most pervasive problems and give their frequency in our five dictionaries. The problems fall into two general types: those which arise because information in the dictionary is incomplete, and those which are the result of a lack of distinction among terms and the lack of a one-to-one mapping between terms and concepts, especially at the highest levels of the hierarchy.
3.1. I n c o m p l e t e information The information in dictionaries is incomplete for two main reasons. First, since a dictionary is typically the product of several lexicographers' efforts and is constructed, revised, and updated over many years, there exist inconsistencies in the criteria by which the hypernyms given in definition texts are chosen. In addition, space and readability restrictions, on the one hand, and syntactic restrictions on phrasing, on the other, may dictate that certain information is unspecified in definition texts or left to be implied by other parts of the definition.
3.1.1. Attachment too high : 21-34% The most pervasive problem in automatically extracted hierarchies is the attachment of terms too high in the hierarchy. It occurs in 21-349'0 of the definitions in our sample from the five dictionaries (figure 8). For example, while pan and bottle are vessels in the CED, cup and bowl are simply containers, the hypemym of vessel. Obviously, "this is a cup" and "this is a bowl" both entail (and are not entailed by) "this is a vessel". Further, other dictionaries give vessel as the hypemym for cup and bowl. Therefore, the attachment of cup and bowl to the higher-level term container seems to be an inconsistency within the CED. The problem of attachment too high in the hierarchy occurs relatively randomly within a given dictionary. In dictionaries with a controlled definition vocabulary (such as the LDOCE), the problem of attachment at high levels of thehierarchy results also from a lack of terms from which to choose. For example, ladle and dipper are both attached to spoon in the L D O C E , although "this is a dipper" entails and is not entailed by "this is a ladle". There is no way that dipper could be defined as a ladle (as, for instance, in the CED), since ladle is not in the defining vocabulary. As a result, hierarchies extracted from the LDOCE are consistently flat (figure 7).
3.1.2. Absent h y p e r n y m s : 0-3% In some cases, strategies likc that of Chodorow, Byrd and Hcidorn yield incorrect hypernyms, as in the following definitions: g r ill A grill is a part of a cooker... [COBUILD] c o r k s c r e w a pointed spiral piece of metal... [W9I d i n n e r service a ecm~plete set of plates and dishes... [LDOCE, not included in o u r corpus] The words part, piece, set, are clearly not hypernyms of the defined concepts: it is virtually meaningless to say that grill is a kind of part, or that corkscrew is a kind of piece. In these cases, the head of the noun phrase serves to mark another relation: part-whole, member-class, etc. It is easy to reject these and similar words (member, :series, etc.) as hypemyms, since they form a closed list (Kiavans, Chodorow, and Wacholder, 1990). However, excluding these words leaves us with no hypernym. We call these "absent hypernyms"; they occur in 0-3% of the definitions in our sample corpus (figure 8). The absence of a hypernym in a given definition text does not necessarily imply that no hypernym exists. For example, "this is a corkscrew" clearly entails (and is not entailed by) "this is a device" (the hypemym given by the COBUILD and the CED). In many eases, the lack of a hypernym seems to be the result of concern over space and/or readability. We can imagine, for example, that the definition for corkscrew could be more fully specified as "a device consisting of a pointed spiral piece of metal..." In such cases, lexicographers rely on the reader's ability to deduce that something made of metal, with a handle, used for pulling corks, can be called a device. However, for some terms, such as cutlery or dinner service, it is not clear that a hypernym exists. Note that we have voluntarily excluded problematic terms of this kind from our corpus, in order to restrict our evaluation to the best Case.
3.1.3. Missing overlaps : 8-14% Another problem results from the necessary choices that lexicographers must make in an attempt to specify a
containers. Although this representation is more intuitively accurate than the representation in figure 5.b, ultimately it goes
themselves do not agree, and when taken formally they yield very different diagrams for higher level concepts.
object container "
gl!ss b o w ~ e ~ l
plate tureen pressure, coffee- bottle pan cooker pot
frying-pan saucepan
container
F i g u r e 6. S o l v i n g " l o o p s " Figure 8 shows that 7-11% of the definitions use a hypcmym that is itself defined circularly.
utensil i n s t r u m e n t i m p l e m e n t
spatula spoon knife fork
ladle
dippe¢
glass bowl cup dish kettle pot coffee- teapot bottle p a n
pre~sure- cooker r,aucepan frying-pan dipper Figure 7. Hierarchies for the CED and LDOCE
plate t u r e e n
%
tool Made i n s t r u m e n t
spatula spoon knife fork
C O B UILD
3.3. S u m m a r y Altogether, the p r o b l e m s described in the sections above yield a 55-70% error rate in automatically extracted hierarchies. Given that we have attempted to consider the most favorable case, it appears that any single dictionary, taken in isolation, is a poor source of automatically extracted semanlic information. This is made more cvidcm in figure 7, which demonstrates the marked differences in hierarchies extracted from the
summary of our results appears in figure 8.
COLliNS I.DOCE OALD W9 COMBINED Figure 8. (~uantitative evaluation
We have concluded that hierarchies extracted using strategies such as that of Chodorow, Byrd, and Heidom are seriously flawed, and are therefore likely to be unusable in NLP systems. However, in this section we discuss various means to refine automatically extracted hierarchies, most of which can be pcrformcd automatically.
WORD COIIUILD C O L L I N S L D o c E ' O A L D W ladle spoon spoon spoon h a s i n container container container ewer jug jug OR pitcher container saucepan pot pan pot g r i l l (absent) devioe (absent) fork tool. implement instrument Figure 9. Mer
4.1. M e r g i n g d i c t i o n a r i e s It is possible to use information provided in the
example, in the definition vessel any object USI.:DAS a container... ICED]
However, some additional processing of the definition
phrase "used as". It is also possible to use other
cutlery implements used for eating SUCII AS knives, forks, and spoons. ICED]
demands some extra parsing, which may be difficult for complex definitions. Also, further research is required to determine which phrases function as markers for which kind of information, and to determine how consistent their use is. More importantly, such information is sporadic, and its extraction may require more effort than the results warrant. We therefore seek more "brute force" methods to improve automatically ex tracted hierarchies. One of the most promising strategies for refining extracted information is the Use of information from several dictionaries. Hierarchies derived from individual dictionaries suffer from incompleteness, but it is extremely unlikely that the same information is consistently missing from all dictionaries. For instance,
It is therefore possible to use taxonomic information from several dictionaries to fill in absent hypemyms, missing links, and to rectify cases of too high attachment. To investigate this possibility, we merged the information extracted from the five English dictionaries in our database. The individual data for the five dictionaries was organized in a table, as in figure 9. Merging these hierarchies into a single hierarchy was accomplished automatically by applying a simple algorithm, which scans the table line-by-line, as follows:
do not reliably provide a hypemym.
the hypernym. Otherwise: a) if a term is a hypernym of another term in the line, ignore it. b) take the remaining cell or cells as the hypernym(s).
This algorithm must be applied recursively, since, for example, it may not yet be known when evaluating
processed. Therefore, several passes through the tab!e are required. Note that if after applying the algorithm several terms are left as hypernyms for a given word, we effectively create an overlap in the hierarchy. For
We evaluate the quality of the resulting combined hierarchy using the same strategy applied in section 3. It is interesting to note that in the merged hierarchy, all the absent hypernym problems (including absence due to or-heads) have been eliminated, since in every case at least one of the five dictionaries gives a valid hypemym. In addition, almost all of the attachments too high in the hierarchy and missing overlaps have disappeared, although a few cases remain (5% and 1%, respectively). None of the dictionaries, for instance,
the elimination of many of these remaining
Merging dictionaries on a large scale assumes that it is possible to automatically map senses across them. For our small sample, we mapped senses among dictionaries by hand. We describe elsewhere a promising method to automatically accomplish sense mapping, using a spreading activation algorithm (lde and Vtronis, 1990).
4.2. C o v e r t c a t e g o r i e s There remain a number of circularly-defined hypemyms in the combined taxonomy, which demand additional consideration on theoretical grounds. Circularly-def'med terms tend to appear when lexicographers lack terms to designate certain concepts. The fact that "it is not impossible for what is intuitively recognized as a conceptual category to be without a label" has already been noted (Cruse, 1986, p. 147). The lack of a specific term for a recognizable concept tends to occur more frequently at the higher levels of the hierarchy (and at the very lowest and most specific levels as well--e.g., there is no term to designate forks with two prongs). This is probably because any language
1958), that is, the level of everyday, ordinary terms for
Circularity, as well as the use of or-conjoined terms at the high levels of the hierarchy, results largely from the lexicographers' efforts to approximate the terms they lack. For example, there is no clear term to denote that category of objects which fall under any of the terms
concept seems to exist. Clearly, these terms are not
term, let us say X, for the concept existed, then the