[Snowball-discuss] Re: Spanish word stemmer

Martin Porter martin.porter at grapeshot.co.uk
Tue Jun 14 17:07:41 BST 2005


Felipe,

Thank you for the suffix list. It has proved interesting to work through it. The
general answer to your question is that the non-inclusion of these suffixes
(apart from ante/antes -- see below) is intentional, and also is best for the
algorithm.

You must remember that if a word end with X, and X is a suffix in the language,

a) X may significantly alter the meaning of a word. In this case it should not,
in an IR context, be removed.

b) X may not be a true suffix, but merely form the end of the stem.

c) X may be rare in the language, and hardly therefore wirth removing.

d) X may be removable, but not worth removing because it leads to no further
conflations.

Most of the suffixes you instance exhibit one or more of these features. Here is
your list:



SUFFIX EXAMPLES STEMMER RIGHT

orio/a/os/as                        <--- changes meaning too much
 absolutorio   absolutori  absolut
 accesorio     accesori    acces
 consultorio   consultori  consult

atorio/ia/ios/ias                   <--- changes meaning too much
 adoratorio    adoratori   ador
 eliminatorias eliminatori elimin
 acusatorias   acusatori   acus
 amatorias     amatori     amat
 aclaratorio   aclaratori  aclar

ante/es                             <================= done
 agonizante    agonizant   agoniz
 alarmante     alarmant    alarm
 abundante     abundant    abund
 caminante     caminant    camin
 emigrante     emigrant    emigr
 participante  participant particip


io/ia/ios/ias
 agravio       agravi       agrav
 alergia       alergi       alerg
 agraria       agrari       agrar
 academia      academi      academ

or/ora/ores/oras
 agresor       agresor      agres

ion                                 <--- removal rarely results in useful
conflation
 agresión      agresion     agres
 admisión      admision     admis
 adopción      adopcion     adopc
 afección      afeccion     afecc

ito/ita/itos/itas                   <--- diminutive; often not an ending; rare
 ahorita       ahorit       ahor
 abuelita      abuelit      abuel

esa/esas                            <--- feminine; often not an ending; rare
 alcaldesa     alcaldes     alcald

ador/edor/idor                      <--- alters meaning too much:
                                    abrir open; abridor can-, bottle-opener
                                    conocer know; conedor expert
 nadador       nadador      nad
 corredor      corredor     corr
 abridor       abridor      abr
 ganador       ganador      gan
 rompedor      rompedor     romp
 seguidor      seguidor     segu

ia                                  see ia above
 alemania      alemani      aleman
 italia        itali        ital
 francia       franci       franc

icio                                <--- rare alimentar feed; alimenticio
nourishing
 alimenticio   alimentici  aliment

al                                  <--- rarer than English; many exceptions
                                    cardenal: cardinal (Math), cardeno purple
 ambiental     ambiental   ambient
 opcional      opcional    opcion
 monumental    monumental  monument
 doctoral      doctoral    doctor
 arbitral      arbitral    arbitr
 semanal       semanal     seman
 accidental    accidental  accident

ote/ota                             <--- augmentative
 amigote       amigot      amig
 grandote      grandot     grand
 palabrota     palabrot    palabr

ete/etes                            <--- ?
 abogadete     abogadet    abog

illo/a/os/as
 abogadillo    abogadill   abog

ato/atos
 anonimato     anonimat    anonim
 asesinato     asesinat    asesin
 alegato       alegat      aleg

aje/ajes
 arbitraje     arbitraj    arbitr
 aterrizaje    aterrizaj   aterriz
 camuflaje     camuflaj    camufl
 doblaje       doblaj      dobl

edad/edades                         <--- stem too short for this case
 brevedad      breved      brev
 enfermedad    enfermed    enferm
 gravedad      graved      grav
 salvedad      salved      salv

ísimo/ísimos                        <-- like ital. issimo
 buenísimo     buenisim    buen
 malísimo      malisim     mal
 rarísimo      rarisim     rar

ez/eces
 estupidez     estupidez   estupid
 sencillez     sencillez   sencill
 acidez        acidez      acid
 robustez      robustez    robust

izar
 actualizar    actualiz    actual
 mecanizar     mecaniz     mecan
 colonizar     coloniz     colon
 agilizar      agiliz      agil
 civilizar     civiliz     civil

So -orio on the whole changes meaning too much (acceso = access, accessorio =
accessory differ as much in Spanish as English; -atorio similarly (aclarar to
rinse, clear (in a very general sense), brighten up; aclaratorio = explanatory).

Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote, -isimo
are in this category.

-al and -iz look like plausible candidates for ending removal, but, unlike their
English counterparts, removing them makes little difference or improvement.
Similarly with -ion removal after -s.

There is a difficulty with pure vowel endings, and the stemmer can't always get
this right. So in English 'academic' is stemmed to 'academ' but 'academy' does
not lose the final -y (or -i). This explains the residual vowels with -io, -ia
endings etc.

Your -edad endings are not removed when the stem is this short: the shorter the
stem the more chance there is of a suffix strongly altering word meaning (see
the original Porter stemmer discussion).

But you spotted ante/antes, which is useful and which I have added in (new
release soon). I can see historically how this came to be omitted, but I won't
bore you with the details.

In the case of attached pronouns, I only included the commoner forms. (For
example, '-noslo' appeared nowhere in our sample data.)

Your question about 'Among' I did not understand. Is this in the java generated
code?


I hope this answers your various questions.

Martin





More information about the Snowball-discuss mailing list