[Snowball-discuss] Re: Spanish word stemmer
Martin Porter
martin.porter at grapeshot.co.uk
Tue Jun 14 17:07:41 BST 2005
Felipe,
Thank you for the suffix list. It has proved interesting to work through it. The
general answer to your question is that the non-inclusion of these suffixes
(apart from ante/antes -- see below) is intentional, and also is best for the
algorithm.
You must remember that if a word end with X, and X is a suffix in the language,
a) X may significantly alter the meaning of a word. In this case it should not,
in an IR context, be removed.
b) X may not be a true suffix, but merely form the end of the stem.
c) X may be rare in the language, and hardly therefore wirth removing.
d) X may be removable, but not worth removing because it leads to no further
conflations.
Most of the suffixes you instance exhibit one or more of these features. Here is
your list:
SUFFIX EXAMPLES STEMMER RIGHT
orio/a/os/as <--- changes meaning too much
absolutorio absolutori absolut
accesorio accesori acces
consultorio consultori consult
atorio/ia/ios/ias <--- changes meaning too much
adoratorio adoratori ador
eliminatorias eliminatori elimin
acusatorias acusatori acus
amatorias amatori amat
aclaratorio aclaratori aclar
ante/es <================= done
agonizante agonizant agoniz
alarmante alarmant alarm
abundante abundant abund
caminante caminant camin
emigrante emigrant emigr
participante participant particip
io/ia/ios/ias
agravio agravi agrav
alergia alergi alerg
agraria agrari agrar
academia academi academ
or/ora/ores/oras
agresor agresor agres
ion <--- removal rarely results in useful
conflation
agresión agresion agres
admisión admision admis
adopción adopcion adopc
afección afeccion afecc
ito/ita/itos/itas <--- diminutive; often not an ending; rare
ahorita ahorit ahor
abuelita abuelit abuel
esa/esas <--- feminine; often not an ending; rare
alcaldesa alcaldes alcald
ador/edor/idor <--- alters meaning too much:
abrir open; abridor can-, bottle-opener
conocer know; conedor expert
nadador nadador nad
corredor corredor corr
abridor abridor abr
ganador ganador gan
rompedor rompedor romp
seguidor seguidor segu
ia see ia above
alemania alemani aleman
italia itali ital
francia franci franc
icio <--- rare alimentar feed; alimenticio
nourishing
alimenticio alimentici aliment
al <--- rarer than English; many exceptions
cardenal: cardinal (Math), cardeno purple
ambiental ambiental ambient
opcional opcional opcion
monumental monumental monument
doctoral doctoral doctor
arbitral arbitral arbitr
semanal semanal seman
accidental accidental accident
ote/ota <--- augmentative
amigote amigot amig
grandote grandot grand
palabrota palabrot palabr
ete/etes <--- ?
abogadete abogadet abog
illo/a/os/as
abogadillo abogadill abog
ato/atos
anonimato anonimat anonim
asesinato asesinat asesin
alegato alegat aleg
aje/ajes
arbitraje arbitraj arbitr
aterrizaje aterrizaj aterriz
camuflaje camuflaj camufl
doblaje doblaj dobl
edad/edades <--- stem too short for this case
brevedad breved brev
enfermedad enfermed enferm
gravedad graved grav
salvedad salved salv
ísimo/ísimos <-- like ital. issimo
buenísimo buenisim buen
malísimo malisim mal
rarísimo rarisim rar
ez/eces
estupidez estupidez estupid
sencillez sencillez sencill
acidez acidez acid
robustez robustez robust
izar
actualizar actualiz actual
mecanizar mecaniz mecan
colonizar coloniz colon
agilizar agiliz agil
civilizar civiliz civil
So -orio on the whole changes meaning too much (acceso = access, accessorio =
accessory differ as much in Spanish as English; -atorio similarly (aclarar to
rinse, clear (in a very general sense), brighten up; aclaratorio = explanatory).
Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote, -isimo
are in this category.
-al and -iz look like plausible candidates for ending removal, but, unlike their
English counterparts, removing them makes little difference or improvement.
Similarly with -ion removal after -s.
There is a difficulty with pure vowel endings, and the stemmer can't always get
this right. So in English 'academic' is stemmed to 'academ' but 'academy' does
not lose the final -y (or -i). This explains the residual vowels with -io, -ia
endings etc.
Your -edad endings are not removed when the stem is this short: the shorter the
stem the more chance there is of a suffix strongly altering word meaning (see
the original Porter stemmer discussion).
But you spotted ante/antes, which is useful and which I have added in (new
release soon). I can see historically how this came to be omitted, but I won't
bore you with the details.
In the case of attached pronouns, I only included the commoner forms. (For
example, '-noslo' appeared nowhere in our sample data.)
Your question about 'Among' I did not understand. Is this in the java generated
code?
I hope this answers your various questions.
Martin
More information about the Snowball-discuss
mailing list