[Snowball-discuss] Norwegian stemmer charset variants not in step

Olly Betts olly at survex.com
Mon Sep 11 13:07:09 BST 2006


The two character set variants of the Norwegian stemmer have other
differences.  I believe the ISO-8859-1 version is more up to date:

--- norwegian/stem_ISO_8859_1.sbl	2006-09-11 13:00:37.000000000 +0100
+++ norwegian/stem_MS_DOS_Latin_I.sbl	2006-09-11 13:00:37.000000000 +0100
@@ -13,15 +13,15 @@
 
 stringescapes {}
 
-/* special characters (in ISO Latin I) */
+/* special characters (in MS-DOS Latin I) */
 
-stringdef ae   hex 'E6'
-stringdef ao   hex 'E5'
-stringdef o/   hex 'F8'
+stringdef ae   hex '91'
+stringdef ao   hex '86'
+stringdef o/   hex '9B'
 
 define v 'aeiouy{ae}{ao}{o/}'
 
-define s_ending  'bcdfghjlmnoprtvyz'
+define s_ending  'bcdfghjklmnoprtvyz'
 
 define mark_regions as (
 
@@ -43,7 +43,7 @@
             'hetens' 'ers' 'ets' 'et' 'het' 'ast'
                 (delete)
             's'
-                (s_ending or ('k' non-v) delete)
+                (s_ending delete)
             'erte' 'ert'
                 (<-'er')
         )

 
The other stemmers are consistent between character set variations,
except for a differently indented closing round bracket in the
swedish stemmer.

Cheers,
    Olly



More information about the Snowball-discuss mailing list