[Snowball-discuss] Swedish stems need patching
Janko Luin
janko at deltaprojects.se
Thu Jan 17 17:34:11 GMT 2008
I have recently implemented an acts_as_ferret based search engine on a
Swedish site, and ran into the Swedish stemmer head-on. It's mostly
very good, but misses two common noun forms: '-an' and '-ans'. Compare
with the example list:
klocka => klock
klockan => klockan
klockans => klockan
These should all be "klock".
In diff form:
Index: stem_ISO_8859_1.sbl
===================================================================
--- stem_ISO_8859_1.sbl (revision 500)
+++ stem_ISO_8859_1.sbl (working copy)
@@ -40,7 +40,7 @@
'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande'
'arne'
'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er'
'heter'
'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes'
'ens' 'arens'
- 'hetens' 'erns' 'at' 'andet' 'het' 'ast'
+ 'hetens' 'erns' 'at' 'andet' 'het' 'ast' 'an' 'ans'
(delete)
's'
(s_ending delete)
Index: stem_MS_DOS_Latin_I.sbl
===================================================================
--- stem_MS_DOS_Latin_I.sbl (revision 500)
+++ stem_MS_DOS_Latin_I.sbl (working copy)
@@ -40,7 +40,7 @@
'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande'
'arne'
'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er'
'heter'
'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes'
'ens' 'arens'
- 'hetens' 'erns' 'at' 'andet' 'het' 'ast'
+ 'hetens' 'erns' 'at' 'andet' 'het' 'ast' 'an' 'ans'
(delete)
's'
(s_ending delete)
Med vänliga hälsningar
Janko Luin
___________________________________________________________________________
The Delta Projects, Janko Luin, utvecklare, janko at deltaprojects.se
telefon: +46 (0)8-667 76 90, mobil: +46 (0)739-78 29 27⠀
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20080117/8fb30561/attachment.html
More information about the Snowball-discuss
mailing list