|
|
|
Jorge BAPTISTA |
|
Universidade do Algarve – FCHS |
|
Laboratório de Engenharia da Linguagem – CAUTL –
IST |
|
Campus de Gambelas, Faro, Portugal, P-8005-139 |
|
jbaptis@ualg.pt |
|
|
|
|
|
|
For many NLP applications, fully tagged texts
are required. |
|
Even if statistical methods may be used to tag
texts, electronic dictionaries are essential tools for high quality tagging
of large-sized texts. |
|
Large electronic dictionaries of both simple and
multiword lexical units have been built to European Portuguese. |
|
|
|
|
In spite of their size, a non-trivial number of
tokens of large-sized corpora remain untagged. |
|
Suppletive, morphological parsing rules can be
used to cope with many lacunae, especially with regularly derived words. |
|
However, there are empirical limits to
morphological parsers, so that other methods for automatic lexical analysis
must also be envisaged. |
|
|
|
|
Automatic lexical analysis of texts can be
carried out using different methods (see Ranchhod 2001, for an over view). |
|
Most systems, however, even if they use
statistical methods predominantly, also use to some degree an electronic
dictionary, where lexical information, idiosyncratic by nature, is stored. |
|
|
|
|
Large electronic dictionaries of both simple and
compound words have been built for several languages, including Portuguese
(Eleutério et al. 1995, Ranchhod et al. 1999). |
|
In spite of their size, when these lexical
resources are applied to large corpus, a non-trivial number of tokens remain
to be tagged. |
|
The lexicon being an evolving object, one cannot
hope the dictionaries to be so comprehensive and exhaustive that would
contain all possible words. This is particularly the case for regularly
derived words (-ly adverbs, -ize verbs, for example). |
|
|
|
|
Morphological parsers have been built, which can
be used with or without dictionaries. |
|
With such tools it is possible to complete the
dictionary’s lacunae, that is, it is possible to formalize morphological
rules so that the system may recognize (and tag) words that have not been
previously included in the dictionaries. |
|
These rules may be used in a suppletive way (or
in connection with the dictionary), and results from their application can
then be manually checked by linguists and used to extend the coverage of
the initial dictionary. |
|
|
|
|
In this paper, an attempt was made to estimate
the how much of the unknown, hence untagged, tokens of a large size corpus
can be adequately recognized by a morphological parser, |
|
using only regular derivational rules, |
|
trying
to evaluate the precision and to determine empirically the limitations of
this methodology. |
|
A set of morphological rules was built, focusing
on a list of unknown tokens. |
|
Results from this morphological module are here
described and its precision will be evaluated. |
|
|
|
|
|
|
|
|
|
|
Many forms in ERR are perfectly ‘normal’ words
that were just missing the dictionary. |
|
With the help of an inverse list of ERR (also
obtained with INTEX dictionary tools), it was possible to determine some of
the most productive derivational rules at stake: |
|
|
|
|
|
|
nouns formed with suffix –ismo (-ism): vegetarianismo
(vegetarianism), from vegetariano (vegetarian); |
|
nouns and adjectives formed with suffixes ‑logia
(-logy), -ólogo (-logue), -ologista (-logist), and ‑lógico (-logic)
designating names of scientific/technologic domains, the designation of
professionals in those domains and the relation adjective associated to
them: |
|
paleontologia (paleontology), |
|
paleontólogo or paleontologista (paleontologist),
paleontológico (‘paleontologic’, related to paleontology). |
|
|
|
|
nouns and adjectives formed with suffixes ‑mancia
and -mância (-mancy), -mante (-mant), and –mântico (-mantic) designating
names of divinatory arts, the designation of their professionals and the
relation adjective associated to them: |
|
quiromancia / quiromância (chiromancy,
palmistry), quiromante (‘chiromant’, palmist, psychic who reads palms to
devine the future),
quiromântico (‘chiromantic’, related to
chiromancy, palmistry). |
|
|
|
|
Besides these, many derivate words were found
formed with prefixes (Pfx); for these, a list of the 170 most current
prefixes was established, based on the lists available in grammars and new
prefixes found in the text, e.g.: |
|
anti-, auto‑, bi-, contra-, des-, equi-, etno-, extra-,
farmaco-, foto- (photo), geo-, hepta-, hidro-, hipo-, homo-, in-, (and
variants: i-, im-, ir‑), inter-, macro-, mega-, micro-, mono-, neo-, opto-,
pluri-, proto, pseudo-, psico- (psych-), radio-, re-, retro-, semi-, socio-,
super-, tele-, tetra-, trans-, tri-, ultra-, uni-, video-, xeno-, zoo-,
etc. |
|
|
|
|
Obviously, many words can be polysynthetic, i.e.
formed by simultaneous prefixation and suffixation: |
|
|
|
descontroladamente |
|
< des- Pfx+ controlar V+
-ada Sfx-a + -mente Sfx-adv> |
|
(uncontrolledly). |
|
|
|
after a certain point, derivation rules have a
very low productivity, i.e., the number of words regularly formed becomes
negligible. |
|
|
|
|
new morphological parser of Intex
(version
4.33, February 24, 2004; Silberztein 2004:130‑142) |
|
a set of morphological rules were built |
|
these rules are enhanced finite-state
transducers (FST). |
|
|
|
|
Example: |
|
in face of the new, unknown word form umbilicalmente,
formed from the adjective umbilical (idem, related to the navel), the
system checks if there is an adjective umbilical in the lexicon (in fact,
there is), and if so it produces the lexical entry: |
|
|
|
umbilicalmente,umbilicalmente.
ADV+A=umbilical+Sfx=mente |
|
|
|
The context of this new word in the corpus
is: “a imagem de um PS umbilicalmente ligado ao modelo jurídico-penal” (the
image of a Socialist Party umbilically connected to the juridical-penal
model). From this context, the meaning of this adverb should be something
like ‘closely, intimately, or inextricably’. |
|
|
|
|
Remark: |
|
We have built a small module of rules to deal
with the derivation of diminutive, augmentative and superlative forms. |
|
The number of rules built by us is given here as
a mere indication, |
|
C. Mota (2003) has build a larger and more
complex module of rules for the same derivational processes. We did not use
her work here. Therefore, results will ignore this module. |
|
|
|
|
The number of prefixes used (approximately 170)
influences significantly the number of rules in each family. |
|
As we will see below, some of these prefixes
give rise to significant erroneous analyses, so it is possible that in a
future version some of them will be removed, and only used in more
constrained rules. |
|
As this is an on-going research, the number of
rule-families will surely increase. |
|
|
|
|
The morphological module that integrates all
these rules is a 20 Kb FST with 595 states and 990 transitions. |
|
It takes 35 seconds to analyze the 31,337 ERR
list of the training corpus and to produce the 3,533 entries of the
resulting DLF |
|
|
|
|
First, an evaluation was made of the application
of the set of FSTs to the training corpus. |
|
We will present first the lexical coverage of
the morphological module and then assess it success rate |
|
|
|
|
Table 3: Lexical coverage of morphological
rules: results from training corpus. |
|
WF = word forms; DLF=simple word entries; n‑tuples=different
entries produced for the same word‑forms; %DLF=n‑tuples’
percentage of DLF entries; %ERR=percentage of ERR list. |
|
|
|
|
Table 4: Results from training corpus |
|
|
|
|
As we can see, global success rate is high
(approx. 93 %). The most important cause of error consists of initial
strings incorrectly analysed as prefixes. Some adverbs ending in –mente are
analysed in spite of the fact that they present an (incorrectly spelled)
accented vowel: |
|
diáriamente,diáriamente.ADV+ADJ=diária+SFX=mente |
|
(he correct form (diariamente, ‘daily’),
does not have any accent) |
|
|
|
|
Table 5: Lexical Analysis of Test Corpus. |
|
WF = word forms (in millions, M); DWF =
different word forms ; DLF=simple word entries; ERR-0=unknown word-forms; NProp
= candidates to the status of proper names; |
|
ERR-1=remaining
unknown word-forms : ERR list to be tested. |
|
|
|
|
The two fragments do not have exactly the same
size: the testing corpus is 6,2 Mb larger |
|
has 1,3 million words (1,175 different word
forms) more than the learning corpus. |
|
The DLF size is also 1,180 entries larger. |
|
However, the number of unknown word forms
(ERR-0) and of proper names (NProp) is almost the same. |
|
The remaining ERR list (ERR-1) after the proper
names have been discarded is also of comparable size. |
|
|
|
|
Table 6: Lexical coverage of morphological
rules: results from testing corpus. |
|
WF = word forms; DLF=simple word entries; n‑tuples=different
entries produced for the same word-forms; %DLF=n‑tuples’ percentage
of DLF entries; %ERR=percentage of ERR list. |
|
|
|
|
Table 7: Results from training corpus. |
|
|
|
|
|
|
Table 8: Comparison of results from training and
testing corpora. |
|
WF = word forms; DLF=simple word entries; n‑tuples=different
entries produced for the same word-forms; %DLF=n‑tuples’ percentage
of DLF entries; %ERR=percentage of ERR list. |
|
|
|
|
in spite of the different sizes of each corpus,
the number of word forms in each ERR list is almost the same. |
|
The results from the morphological rules are
also equivalent: both the number of different word forms recognized by the
FSTs and the number of entries of the two DLF are approximate. |
|
The combined DLF obtained by the application of
the morphological rules to both corpora contains 6,058 different entries,
corresponding to 4,253 different word forms (including diminutives,
augmentatives and superlatives). |
|
|
|
|
There is a slightly greater number of n‑tuples
on the testing corpus, but the percentage of DLF is practically the same. |
|
The first major difference is the lexical
coverage of the morphological module (%ERR), i.e. the percentage of matched
word forms of each ERR list: |
|
while, in the training corpus, this was about
16%, it becomes little less than 9 % in the testing corpus. |
|
Secondly, success rate diminishes significantly,
from 92.68 % to 77.93 %. |
|
|
|
|
|
|
Even if the morphological rules may constitute
an effective tool to analyze candidate words that can then be manually
checked by linguists in order to extent lexical coverage of electronic
dictionaries, it is clear that a very substantial part of the text’s
different words remain to be tagged: |
|
28,841 in the training corpus and 28,808 in the
test corpus; |
|
if all uppercase words were ignored, these could
be reduced to about half of those: 13,469 in the training corpus and 13,641
in the testing corpus). |
|
|
|
|
It is possible that new morphological rules yet
to be build may contribute to increase lexical coverage of unknown words. |
|
But as we include new rules, these apply to an
increasingly small number of word forms. |
|
We
estimate that the number of Portuguese, correctly formed but unknown words
analyzable by the method of suppletive morphologic rules could still be
increased up to 20 %. |
|
Furthermore, as new rules interact with
previously made rules, the number of words with multiple analysis
increases, thus diminishing the precision of the results. |
|
|
|
|
|
What is the nature of remaining unknown words? |
|
The most common cases found were: |
|
spelling errors: many unknown words are just due
to typing or spelling errors: |
|
abanadonaria (abandonaria, ‘would abandon’), |
|
abastenção (abstenção, ‘abstention’), |
|
abatecimento (abastecimento, ‘supplying’), etc. |
|
Some errors are due to conversion between
character sets: |
|
bonificaÁões (bonificações, ‘bonifications’) |
|
consÛrcios (consórcios, consortium (pl)) |
|
|
|
|
words derived from proper names (mainly
adjectives): |
|
aladinescos (from Aladin), balzaquiano (Balzaquian),
hartleyano (from Hartley) |
|
hitchcockiana, hitchcoquiana (from
Hitchcock, notice the orthographic adaptation to Portuguese spelling rules:
– ck > -qu - ) |
|
deskhomeinização
(des-Pfx
+ Khomeni Nprop +iz Sfx-v + ation Sfx-n,
from Khomeni, Nprop) |
|
|
|
|
foreign words: in real texts, and in particular
in journalistic texts, there are many foreign words. Mostly, these came
from English and French, but other languages can also be found: |
|
|
|
|
From the examples above, it is clear that no real
corpus is free from many of these problems |
|
for robust (non-statistical) lexical analysis,
several strategies must be used in combination with dictionaries and a
suppletive morphologic analyzer |
|
error detection and correction, comparing
unknown forms and lexicalized forms by letter changing, permutation, and so
on; |
|
development of morphologic rules based on dictionaries
of proper names; |
|
language identification procedures, enabling the
system to work with texts with mixed languages. |
|
|
|
|
While strategies (1) and (3) have already been
put in place independently in orthographic correctors in text-editors
(MS-Word, for instance) and web browsers, strategy (2) has not seen much
effort from (Portuguese) lexicographers, specially in view of automatic
lexical analysis. |
|
It combines encyclopedic dictionaries with
morphologic analyzers, an approach similar to the one here shown. However,
to our knowledge, the combination of these strategies (eventually others)
in the same system has not been done yet. |
|
|
|
|
From results obtained so far, precision of
morphological rules is high (90% average), |
|
it is clear that the goal of zero unknown tokens
is still far from being achieved |
|
only less than 20% of ERR were matched, by means
of suppletive morphologic rules. |
|
In real life, there is no such thing as a
‘clean’ corpus: typos, foreign words, and proper names’ derivates are the
sets of unknown tokens most responsible for this insufficiency in automatic
lexical analysis. |
|
|
|
|
For robust lexical analysis of these forms,
other strategies must be found, |
|
these may involve not only language
identification procedures (and use of the corresponding dictionaries) but
also correction of deviating or erroneous forms. |
|
The combination of different strategies in a
single system may constitute both a linguistic and a computational
challenge in the near future. |
|
|
|
|
Acknowledgements |
|
Research for this paper was partially funded
by Fundação para a Ciência e a Tecnologia (project grant
POSI/PLP/34729/99). Thanks are due to C. Mota for making available her
DimAum module. |
|
|
|
References |
|
Eleutério, S.; Ranchhod, E.; Freire, H;
Baptista, J. 1995. A system of electronic dictionnaries of Portuguese. Linguisticae
Investigationes 17-2: 57-82. Amsterdam: John Benjamins B. V. |
|
Ranchhod, E.; Mota,C.; Baptista, J. 1999. A
Computational Lexicon of Portuguese for Automatic Text Parsing. SIGLEX’99:
Standardizing Lexical Resources. Proceedings of a Workshop Sponsored by the
Special Interest Group on the Lexicon of the Association for Computational
Linguistics and the National Science Foundation (June 21-22, 1999,
University of Maryland, College Park, Maryland, USA). pp. 74 80: Maryland:
University of Maryland. |
|
Baptista, J.; Faísca, J. 2001. Um filtro para
palavras exóticas frequentes do Português. Seminários de Linguística 4:
65-86. Faro: UALG-FCHS/CELL. |
|
Baptista, J.; Faísca, J. 2003. Mapping,
filtering and measuring impact of ambiguous words of Portuguese, 6th Intex
Workshop, Sofia, Bulgaria (May 28-30, 2003). |
|
Silberztein, M. 2004. Intex Manual.
http://intex.univ-fcomte.fr/downloads/Manual.pdf |
|
Mota, C. 2003. A Renewed Portuguese Module for Intex 4.3x. 6th
Intex Workshop, Sofia, Bulgaria (May 28-30, 2003). |
|
Mota, C.
2000, Analysis of Derivational Morphology by Finite State
Transducers, in Dister, A. (ed.), Actes des Troisièmes Journées INTEX, Revue,
Informatique et Statistique dans les Sciences Humaines, 36, pp. 273-287,
Université de Liège. |
|
Ranchhod, E. 2001, O uso de dicionários e de
autómatos finitos na representação lexical das línguas naturais, in
Ranchhod, E. (Org.) Tratamento das línguas por computador. Uma introdução à
Linguística Computational e suas aplicações, pp. 13-47. Lisboa: Caminho |
|