[Snowball-discuss] Swedish stems need patching
Karl Wettin
karl.wettin at gmail.com
Wed Jan 7 09:44:47 GMT 2009
I've been looking in to this. -an and -ans suffix removal stemming
(removing n or ns) is rather damaging. All together there are 1500
Swedish words in singlar form that are suffixed with an or ans, some
50 of them are problematic when this suffix is removed. It would of
course be possible to keep track of all these exceptions that should
not be stemmed.
A few of the problematic words: afrikan, amerikan, banan, glan, glans,
gan, gans, dans (brygdans, gammeldans, mfl) , dan, fina, finans,
krans, kran, koran, roman, romans, samman, seans (actually
problematic with the current stemmer too), svan, svans, tran, trans,
usans, vacklan, vakans
There are probable more words that I have not seem as I only stemmed
using the rule as described above, i.e. there might be lots of words
that when stemmed using other rules in the Swedish stemmer will show
not to be that unique and problematic combined with the an/ans rule.
I'll try to come up with something later on, for now I'm "happy" to
know that the an/ans suffix stemming can not be applied without
further work.
karl
29 jan 2008 kl. 17.54 skrev Martin Porter:
>
>
> Janko,
>
> Either -an, -ans was overlooked when the stemmer was written, or the
> effect of including -an, -ans was found to be generally damaging
> because
> of the additional mis-stemming it produced (Karl Wettin's example
> list...) At this distance from the original work I cannot say which is
> true.
>
> Presumably you have included -an, -ans in your version? If so, could
> you
> report back later on whether you encounter mis-stemming problems? We
> can
> continue from there.
>
> (Something you have to do in introducing a new rule is to compare the
> sample vocabulary stemmed with the extra rule and stemmed without the
> extra rule. You draw up 3 columns,
>
> word stem1 stem2 *
> ... ... ...
>
> and put '*' at the end when stem1 and stem2 differ. Then you go
> through
> by hand looking for the * indicator and monitoring the improvements
> and
> degradations. As a native speaker of Swedish, you would of course be
> much better placed than me to judge the result.)
>
> I hope this helps,
>
> Martin
>
>
>
>
> On Thu, 2008-01-17 at 18:34 +0100, Janko Luin wrote:
>> I have recently implemented an acts_as_ferret based search engine
>> on a
>> Swedish site, and ran into the Swedish stemmer head-on. It's mostly
>> very good, but misses two common noun forms: '-an' and '-ans'.
>>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
More information about the Snowball-discuss
mailing list