[Snowball-discuss] Swedish stemmer advocacy

Sun Jan 3 12:44:02 GMT 2010

The Swedish Snowball stemmer does a terrible job according to <http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf 
 >. It even claims that lfs5, i.e. substring(0,5), does a better job.  
(It also says that 5-grams cracks the nut.)

This didn't come as surprise to me as I've identified problems in the  
past and implemented my own augmentation that's been posted to this  
list before, now living at <http://issues.apache.org/jira/browse/LUCENE-1515 
 >.

Reading the paper made me take a closer look at what's wrong.

     define main_suffix as (
         setlimit tomark p1 for ([substring])
         among(
             'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande'  
'arne'
             'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er'  
'heter'
             'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes'  
'ens' 'arens'
             'hetens' 'erns' 'at' 'andet' 'het' 'ast'
             'era' 'erar' 'erarna' 'erarnas'
             // augmentation starts here
             'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
             'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar'  
'anserna' 'ansernas'
             'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades'
             'ikation'
             'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens'
             // augmentation ends here
                 (delete)

             's'
                 (s_ending delete)

In conjunction with ~200 exception rules these additions help. There  
are however quite a bit of problems with many of the old rules.

E.g. 's' (s_ending delete) is a pluralis rule but have ~5300  
exceptions where words ends with s is nominative case singularis. The  
problem is when written in other form than nominative case.

kurs (course)
kursen (the course)
kursens (the [undefined noun] of the course)
kurser (courses)
kurserna (the courses)
kursernas (the [undefined noun] of the courses)

Kurs is stemmed to "kur" (which by the way will missmatch with kur as  
in remedy) while all the others are correctly stemmed as "kurs".

All together there are, according to my estimation, some 10 000 words  
that will create incompatible stems between nominative case singularis  
and any other form. That is about 8% of the official language.

One rather simple solution is to always use both unstemmed and stemmed  
words, e.g. as synonyms in an inverted index. But if only using the  
stemmed output (from the official stemmer or my augmentation) I'd  
argue it's better to skip stemming all together.

A better solution would be to set up the stemmer to ignore the 10 000  
exceptions. What would be the best way to implement this? I'd like the  
generated Java code to simply contain a HashSet<String>  
noStemExceptions; that was checked first, or something like that.

         karl