[Snowball-discuss] Swedish stems need patching

Karl Wettin karl.wettin at gmail.com
Wed Jan 7 09:44:47 GMT 2009


I've been looking in to this. -an and -ans suffix removal stemming  
(removing n or ns) is rather damaging. All together there are 1500  
Swedish words in singlar form that are suffixed with an or ans, some  
50 of them are problematic when this suffix is removed. It would of  
course be possible to keep track of all these exceptions that should  
not be stemmed.

A few of the problematic words: afrikan, amerikan, banan, glan, glans,  
gan, gans, dans (brygdans, gammeldans, mfl) , dan, fina, finans,  
krans, kran, koran, roman, romans, samman,  seans (actually  
problematic with the current stemmer too), svan, svans, tran, trans,  
usans, vacklan, vakans

There are probable more words that I have not seem as I only stemmed  
using the rule as described above, i.e. there might be lots of words  
that when stemmed using other rules in the Swedish stemmer will show  
not to be that unique and problematic combined with the an/ans rule.  
I'll try to come up with something later on, for now I'm "happy" to  
know that the an/ans suffix stemming can not be applied without  
further work.


        karl


29 jan 2008 kl. 17.54 skrev Martin Porter:

>
>
> Janko,
>
> Either -an, -ans was overlooked when the stemmer was written, or the
> effect of including -an, -ans was found to be generally damaging  
> because
> of the additional mis-stemming it produced (Karl Wettin's example
> list...) At this distance from the original work I cannot say which is
> true.
>
> Presumably you have included -an, -ans in your version? If so, could  
> you
> report back later on whether you encounter mis-stemming problems? We  
> can
> continue from there.
>
> (Something you have to do in introducing a new rule is to compare the
> sample vocabulary stemmed with the extra rule and stemmed without the
> extra rule. You draw up 3 columns,
>
>  word       stem1     stem2   *
>  ...        ...       ...
>
> and put '*' at the end when stem1 and stem2 differ. Then you go  
> through
> by hand looking for the * indicator and monitoring the improvements  
> and
> degradations. As a native speaker of Swedish, you would of course be
> much better placed than me to judge the result.)
>
> I hope this helps,
>
> Martin
>
>
>
>
> On Thu, 2008-01-17 at 18:34 +0100, Janko Luin wrote:
>> I have recently implemented an acts_as_ferret based search engine  
>> on a
>> Swedish site, and ran into the Swedish stemmer head-on. It's mostly
>> very good, but misses two common noun forms: '-an' and '-ans'.
>>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss




More information about the Snowball-discuss mailing list