[Snowball-discuss] Swedish stemmer advocacy

Mon Jan 25 11:59:16 GMT 2010

Karl,

Thanks for your email of 3rd January, which has certainly set me thinking. I
have now read the paper of McNamee, Nicholas and Mayfield, and overall found
their results encouraging for snowball, despite your observations on its
peformance with the currently offered Swedish stemmer. You have to realise
that I've been informed from time to time of very negative behaviour for the
stemmers: for example that the French stemmer performs worse than no
stemming at all, so to see snowball giving significant performance
improvements for all its languages (with merely 'improvement' for Dutch) is
therefore good news. That applies to Swedish as well.

My understanding is that their 4-gram 5-gram methods run over word
boundaries, so they are capturing phrase searching by the adjacency of query
terms. The 4-gram and 5-gram methods also help with respelling, and the use
of compounds (Pfafferweiterbildunskomission etc.) In a sense the success of
the n-gram approach depends upon an underlying IR model, that one will not
attempt to discover phrases in the text of queries, or attempt respelling
prior to stemming and so on. Their 'snow' and 'n-gram' is not quite a
parallel comparison. 

But to get to Swedish. I'd like to revisit this, but need to understand
exactly your point about the -s ending, where my estimate of its incidence
does not match your figure. I've been through the L section in the Routledge
Swedish Dictionary (London, 1993). Looking at headwords only, there are 34
words ending -s, (lackmus, lakrits, laktos, lans ...) and only 24 which are
noun forms (for example, laktos is adjectival). This leads to an estimate of
about 550 words ending -s in the whole dictionary. A useful subclass is
words ending -lo"s (l, o-umlaut, s) from which the s, I take it, should not
be removed. I have not looked at the sample Swedish vocabulary, but my
impression is that -s endings are not a major factor. Have I misunderstood
you here?

Martin

At 01:44 PM 1/3/2010 +0100, Karl Wettin wrote:
>The Swedish Snowball stemmer does a terrible job according to
<http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf 
> >. It even claims that lfs5, i.e. substring(0,5), does a better job.  
>(It also says that 5-grams cracks the nut.)
>
>This didn't come as surprise to me as I've identified problems in the  
>past and implemented my own augmentation that's been posted to this  
>list before, now living at <http://issues.apache.org/jira/browse/LUCENE-1515 
> >.
>
>Reading the paper made me take a closer look at what's wrong.
>
>     define main_suffix as (
>         setlimit tomark p1 for ([substring])
>         among(
>             'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande'  
>'arne'
>             'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er'  
>'heter'
>             'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes'  
>'ens' 'arens'
>             'hetens' 'erns' 'at' 'andet' 'het' 'ast'
>             'era' 'erar' 'erarna' 'erarnas'
>             // augmentation starts here
>             'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
>             'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar'  
>'anserna' 'ansernas'
>             'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades'
>             'ikation'
>             'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens'
>             // augmentation ends here
>                 (delete)
>
>             's'
>                 (s_ending delete)
>
>
>
>In conjunction with ~200 exception rules these additions help. There  
>are however quite a bit of problems with many of the old rules.
>
>
>E.g. 's' (s_ending delete) is a pluralis rule but have ~5300  
>exceptions where words ends with s is nominative case singularis. The  
>problem is when written in other form than nominative case.
>
>kurs (course)
>kursen (the course)
>kursens (the [undefined noun] of the course)
>kurser (courses)
>kurserna (the courses)
>kursernas (the [undefined noun] of the courses)
>
>Kurs is stemmed to "kur" (which by the way will missmatch with kur as  
>in remedy) while all the others are correctly stemmed as "kurs".
>
>All together there are, according to my estimation, some 10 000 words  
>that will create incompatible stems between nominative case singularis  
>and any other form. That is about 8% of the official language.
>
>One rather simple solution is to always use both unstemmed and stemmed  
>words, e.g. as synonyms in an inverted index. But if only using the  
>stemmed output (from the official stemmer or my augmentation) I'd  
>argue it's better to skip stemming all together.
>
>A better solution would be to set up the stemmer to ignore the 10 000  
>exceptions. What would be the best way to implement this? I'd like the  
>generated Java code to simply contain a HashSet<String>  
>noStemExceptions; that was checked first, or something like that.
>
>
>         karl
>