Extracting Semantic Hierarchies from Dictionaries: Challenges and Solutions | Exercises English Literature

AN ASSESSMENT OF SEMANTIC INFORMATION AUTOMATICALLY

EXTRACTED FROM MACHINE READABLE DICTIONARIES

Jean V~ronis 1.2and Nancy Ide t

tDepartrnent of Computer Science

VASSAR COLLEGE

Poughkeepsie, New York 12601 (U.S.A.)

:~Groupe Representation et Traitement des Connalssances

CF_.~E NATIONAL DE LA RECHERCHE SCIENTIFIQUE

31, Ch. Joseph Aiguier

13402 Marseille Cedex 09 (France)

ABSTRACT

In this paper we provide a quantitative evaluation of

information automatically extracted from machine

readable dictionaries. Our results show that for any one

dictionary, 55-70% of the extracted information is

garbled in some way. However, we show that these

results can be dramatically reduced to about 6% by

combining the information extracted from five

dictionaries. It therefore appears that even if individual

dictionaries are an unreliable source of semantic

information, multiple dictionaries can play an important

role in building large lexical-semantic databases.

1. INTRODUCTION

In recent years, it has become increasingly clear that the

limited size of existing computational lexicons and the

poverty of the semantic information they contain

represents one of the primary bottlenecks in the

development of realistic natural language processing

(NLP) systems. The need for extensive lexical and

semantic databases is evident in the recent initiation of a

number of projects to construct massive generic

lexicons for NLP (project GENELEX in Europe or

EDR in Japan).

The manual coustruction of large lexical-semantic

databases demands enormous human resources, and

there is a growing body of research into the possibility

of automatically extracting at least a part of the required

lexical and semantic informati'on from everyday

dictionaries. Everyday dictionaries are obviously not

structured in a way that enables their immediate use in

NLP systems, but several Studies have shown that

relatively simple procedures can be used to extract

taxonomies and various other semantic relations (for

example, Amsler, 1980; Calzolari, 1984; Cbodorow,

Byrd, and Heidorn, 1985; Markowitz, Ahlswede, and

Evens, 1986; Byrd et al., 1987; Nakamura and Nagao,

1988; Vtronis and Ide, 1990~ Klavans, Chodorow, and

Wacholder, 1990; Wilks et al., 1990).

However, it remains to be seen whether information

automatically extracted from dictionaries is sufficiently

complete and coherent to be actually usable in NLP

systems. Although there is concern over the quality of

automatically extracted lexical information, very few

empirical studies have attempted to assess it

systematically, and those that have done so have been

restricted to consideration of the quality of grammatical

information (e.g., Akkerman, Masereeuw, and Meijs,

1985). No evaluation of automatically extracted

semantic information has been published.

The authors would like to thank Lisa Lassck and Anne Gilman

for

their

contribution to this work.

In this paper, we report the results of a quantitative

evaluation of automatically extracted sernanuc data. Our

results show that for any one dictionary, 55-70% of the

extracted information is garbled in some way. These

results at first call into doubt the validity of automatic

extraction from dictionaries. However, in section 4 we

show that these results can be dramatically reduced to

about 6% by several means--most significantly, by

combining the information extracted from five

dictionaries. It therefore appears that even if individual

dictionaries are an unreliable source of semantic

information, multiple dictionaries can play an important

role in building large lexical-semantic databases.

2. METHODOLOGY

Our strategy involves automatically extracting

hypernyms from five English dictionaries for a limited

corpus. To determine where problems exist, the

resulting hierarchies for each dictionary are compared to

an "ideal" hierarchy constructed by hand. The five

dictionaries compared were: the Collins English

Dictionary (CED), the Oxford Advanced Learner's

Dictionary (OALD), the COBUILD Dictionary, the

Longman's Dictionary of Contemporary English

(LDOCE) and the Webster's 9th Dictionary (W9).

We begin with the most straightforward case in order to

determine an upper bound for the results. We deal with

words within a domain which poses few modelling

problems, and we focus on hyperonymy, which is

probably the least arguable semantic relation and has

been shown to be the easiest to extract. If the results are

poor under such favorable constraints, we can foresee

that they will be poorer for more complex (abstract)

domains and less clearly cut relations.

An ideal hicrarchy probably does not exist for the entire

dictionary; however, a fair degree of consensus seems

possible for carefully chosen terms within a very

restricted domain. We have therefore selected a corpus

of one hundred kitchen utensil terms, each representing

a concrete, individual object--for example,

cup, fork,

saucepan, decanter,

etc. All of the terms are count

nouns. Mass nouns, which can cause problems, have

been excluded (for example, the mass noun

cutlery

not a hypernym of

knife).

Other idiosyncratic cases,

such as

chopsticks

(where it is not clear if the utensil is

one object or a pair of objects) have also been

eliminated from the corpus. This makes it easy to apply

simple tests for hyperonymy, which, for instance,

enable us to say that Y is a hypcmym of X if "this is an

entails but is not entailed by "this is

a Y"

(Lyons,

1963).

Chodorow, Byrd, and Heidorn (1985) proposed a

heuristic for extracting hypernyms which exploits the

fact that definitions for nouns typically give a hypemym

- 227 -

Extracting Semantic Hierarchies from Dictionaries: Challenges and Solutions, Exercises of English Literature

Related documents

Partial preview of the text

Download Extracting Semantic Hierarchies from Dictionaries: Challenges and Solutions and more Exercises English Literature in PDF only on Docsity!

A N A S S E S S M E N T O F S E M A N T I C I N F O R M A T I O N A U T O M A T I C A L L Y

E X T R A C T E D F R O M M A C H I N E R E A D A B L E D I C T I O N A R I E S

A B S T R A C T

1. I N T R O D U C T I O N

Evens, 1986; Byrd et al., 1987; Nakamura and Nagao,

Wacholder, 1990; Wilks et al., 1990).

2. M E T H O D O L O G Y

dictionaries compared were: the Collins English

Dictionary (CED), the Oxford Advanced Learner's

Dictionary (OALD), the COBUILD Dictionary, the

Longman's Dictionary of Contemporary English

(LDOCE) and the Webster's 9th Dictionary (W9).

3. E V A L U A T I O N

whcrc spatula should appear (since wc have no

indication that it is not a conlainer), but at least it shows

that there may be some utensils which arc n o t

I

A I I

CED and LDOCE for a small subset of our corpus. A

4. R E F I N I N G

differentiae of definition texts to refine hierarchies; for

the automatically extracted hypernym is object.

text enables the extraction of container following the

definitions. For example, the CED does not specify that

knife and spoon are implements, but this information is

provided in the definition of cutlery:

The extraction of information from differentiae

the CED attaches cup to container, which is too high in

the hierarchy, while the W9 attaches it lower, to vessel.

by or as null, since, as we saw in section 3.2.1, they

2) if all the cells agree (as for ladle), keep that term as

bct~in that container is a hypernym of vessel, and vessel

is a hypemym of bowl, until those terms are themselves

example, saucepen is attached to both pot and pan, and

fork is attached to tool, implement, and instrument.

gives pot as the hypemym of teapot, although three of

the five dictionaries give pot as the hypernym of

coffeepot. A larger dictionary database would enable

imperfections (for example, New Penguin English

Dictionary, not included in our database, gives pot as a

hypemym of teapot).

includes the most terms at the generic level (Brown,

objects and living things (dog, pencil, house, etc.).

utensil, tool, implement, instrument, although this

strictly synonymous--there are, for example, utensils

that one would not call tools (e.g., a colander). If a

definitions for utensil, tool, implement, and instrument