[Snowball-discuss] Problem with PySnowballStemmer

Patrick Mézard pmezard at gmail.com
Sat Jan 21 18:09:50 GMT 2006


Hello,
First, thank you Weongyo Jeong for providing updated python bindings, I 
was definitely looking for them.

However, I fail to make them work with UTF-8 inputs:
"""
# -*- coding: iso-8859-1 -*-
import SnowballStemmer

encodings = [
     ('UTF_8', 'utf8'),
     ('ISO_8859_1', 'iso-8859-1'),
]

for sn_enc, py_enc in encodings:
     s = SnowballStemmer.SnowballStemmer().new('french', sn_enc)
     #This is a 'latin small letter e acute' at the end of the word.
     u = unicode('pitié', 'iso-8859-1').encode(py_enc)
     print sn_enc, ':', repr(u), '=>', repr(s.stem_str(u))
"""

outputs:
"""
UTF_8 : 'piti\xc3\xa9' => 'piti\xc3'
ISO_8859_1 : 'piti\xe9' => pit
"""

The UTF-8 version returns an invalid UTF-8 sequence. I am completely new 
to Snowball and I have just seen the announce according to which Unicode 
support was added last year. Until now I failed to find reliable 
information about how this is done, even when looking in the code:

1- There is bunch of stemming files in the bindings sources, including 
"stem_UTF_8_french.c". I suppose it was generated from a Snowball 
stemming file. Does the "UTF_8" means the input strings are UTF-8 bytes 
sequences ? I suppose so.

2- Reading the ML I thought UTF-8 was implemented by translating inputs 
to UCS-2 first then stemming them. I cannot find anything looking like 
an UTF-8 decoder/encoder. Besides, "symbol" is defined as an "unsigned 
char". Are the bindings interpreting UTF-8 strings directly?

3- If [2], then AFAIK UTF-8 is nothing else than an encoding layer on 
top of Unicode code values. How does the stemmer handle normalized 
forms? Are there any expectations about them? I tried to send the same 
UTF-8 word in NFD form instead of the default python one (which should 
be NFC or NFKC), but it changed nothing.

The bindings were compiled and tested with:
"""
ActivePython 2.4.2 Build 248 (ActiveState Corp.) based on
Python 2.4.2 (#67, Oct 30 2005, 16:11:18) [MSC v.1310 32 bit (Intel)] on 
win32
"""

Did I miss something obvious ?
Thank you for any idea about this.

--
Patrick Mézard




More information about the Snowball-discuss mailing list