[Snowball-discuss] Norwegian stemmer charset variants not in step
Olly Betts
olly at survex.com
Mon Sep 11 13:07:09 BST 2006
The two character set variants of the Norwegian stemmer have other
differences. I believe the ISO-8859-1 version is more up to date:
--- norwegian/stem_ISO_8859_1.sbl 2006-09-11 13:00:37.000000000 +0100
+++ norwegian/stem_MS_DOS_Latin_I.sbl 2006-09-11 13:00:37.000000000 +0100
@@ -13,15 +13,15 @@
stringescapes {}
-/* special characters (in ISO Latin I) */
+/* special characters (in MS-DOS Latin I) */
-stringdef ae hex 'E6'
-stringdef ao hex 'E5'
-stringdef o/ hex 'F8'
+stringdef ae hex '91'
+stringdef ao hex '86'
+stringdef o/ hex '9B'
define v 'aeiouy{ae}{ao}{o/}'
-define s_ending 'bcdfghjlmnoprtvyz'
+define s_ending 'bcdfghjklmnoprtvyz'
define mark_regions as (
@@ -43,7 +43,7 @@
'hetens' 'ers' 'ets' 'et' 'het' 'ast'
(delete)
's'
- (s_ending or ('k' non-v) delete)
+ (s_ending delete)
'erte' 'ert'
(<-'er')
)
The other stemmers are consistent between character set variations,
except for a differently indented closing round bracket in the
swedish stemmer.
Cheers,
Olly
More information about the Snowball-discuss
mailing list