[Snowball-discuss] Swedish stems need patching

Martin Porter martin.porter at grapeshot.co.uk
Tue Jan 29 16:54:34 GMT 2008



Janko,

Either -an, -ans was overlooked when the stemmer was written, or the
effect of including -an, -ans was found to be generally damaging because
of the additional mis-stemming it produced (Karl Wettin's example
list...) At this distance from the original work I cannot say which is
true.

Presumably you have included -an, -ans in your version? If so, could you
report back later on whether you encounter mis-stemming problems? We can
continue from there.

(Something you have to do in introducing a new rule is to compare the
sample vocabulary stemmed with the extra rule and stemmed without the
extra rule. You draw up 3 columns,

  word       stem1     stem2   *
  ...        ...       ...     

and put '*' at the end when stem1 and stem2 differ. Then you go through
by hand looking for the * indicator and monitoring the improvements and
degradations. As a native speaker of Swedish, you would of course be
much better placed than me to judge the result.)

I hope this helps,

Martin




On Thu, 2008-01-17 at 18:34 +0100, Janko Luin wrote:
> I have recently implemented an acts_as_ferret based search engine on a
> Swedish site, and ran into the Swedish stemmer head-on. It's mostly
> very good, but misses two common noun forms: '-an' and '-ans'. 
> 




More information about the Snowball-discuss mailing list