[Snowball-discuss] Swedish stemmer advocacy
Karl Wettin
karl.wettin at gmail.com
Sun Jan 3 12:44:02 GMT 2010
The Swedish Snowball stemmer does a terrible job according to <http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf
>. It even claims that lfs5, i.e. substring(0,5), does a better job.
(It also says that 5-grams cracks the nut.)
This didn't come as surprise to me as I've identified problems in the
past and implemented my own augmentation that's been posted to this
list before, now living at <http://issues.apache.org/jira/browse/LUCENE-1515
>.
Reading the paper made me take a closer look at what's wrong.
define main_suffix as (
setlimit tomark p1 for ([substring])
among(
'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande'
'arne'
'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er'
'heter'
'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes'
'ens' 'arens'
'hetens' 'erns' 'at' 'andet' 'het' 'ast'
'era' 'erar' 'erarna' 'erarnas'
// augmentation starts here
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar'
'anserna' 'ansernas'
'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades'
'ikation'
'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens'
// augmentation ends here
(delete)
's'
(s_ending delete)
In conjunction with ~200 exception rules these additions help. There
are however quite a bit of problems with many of the old rules.
E.g. 's' (s_ending delete) is a pluralis rule but have ~5300
exceptions where words ends with s is nominative case singularis. The
problem is when written in other form than nominative case.
kurs (course)
kursen (the course)
kursens (the [undefined noun] of the course)
kurser (courses)
kurserna (the courses)
kursernas (the [undefined noun] of the courses)
Kurs is stemmed to "kur" (which by the way will missmatch with kur as
in remedy) while all the others are correctly stemmed as "kurs".
All together there are, according to my estimation, some 10 000 words
that will create incompatible stems between nominative case singularis
and any other form. That is about 8% of the official language.
One rather simple solution is to always use both unstemmed and stemmed
words, e.g. as synonyms in an inverted index. But if only using the
stemmed output (from the official stemmer or my augmentation) I'd
argue it's better to skip stemming all together.
A better solution would be to set up the stemmer to ignore the 10 000
exceptions. What would be the best way to implement this? I'd like the
generated Java code to simply contain a HashSet<String>
noStemExceptions; that was checked first, or something like that.
karl
More information about the Snowball-discuss
mailing list