[Snowball-discuss] Problem with PySnowballStemmer
Patrick Mézard
pmezard at gmail.com
Sat Jan 21 18:09:50 GMT 2006
Hello,
First, thank you Weongyo Jeong for providing updated python bindings, I
was definitely looking for them.
However, I fail to make them work with UTF-8 inputs:
"""
# -*- coding: iso-8859-1 -*-
import SnowballStemmer
encodings = [
('UTF_8', 'utf8'),
('ISO_8859_1', 'iso-8859-1'),
]
for sn_enc, py_enc in encodings:
s = SnowballStemmer.SnowballStemmer().new('french', sn_enc)
#This is a 'latin small letter e acute' at the end of the word.
u = unicode('pitié', 'iso-8859-1').encode(py_enc)
print sn_enc, ':', repr(u), '=>', repr(s.stem_str(u))
"""
outputs:
"""
UTF_8 : 'piti\xc3\xa9' => 'piti\xc3'
ISO_8859_1 : 'piti\xe9' => pit
"""
The UTF-8 version returns an invalid UTF-8 sequence. I am completely new
to Snowball and I have just seen the announce according to which Unicode
support was added last year. Until now I failed to find reliable
information about how this is done, even when looking in the code:
1- There is bunch of stemming files in the bindings sources, including
"stem_UTF_8_french.c". I suppose it was generated from a Snowball
stemming file. Does the "UTF_8" means the input strings are UTF-8 bytes
sequences ? I suppose so.
2- Reading the ML I thought UTF-8 was implemented by translating inputs
to UCS-2 first then stemming them. I cannot find anything looking like
an UTF-8 decoder/encoder. Besides, "symbol" is defined as an "unsigned
char". Are the bindings interpreting UTF-8 strings directly?
3- If [2], then AFAIK UTF-8 is nothing else than an encoding layer on
top of Unicode code values. How does the stemmer handle normalized
forms? Are there any expectations about them? I tried to send the same
UTF-8 word in NFD form instead of the default python one (which should
be NFC or NFKC), but it changed nothing.
The bindings were compiled and tested with:
"""
ActivePython 2.4.2 Build 248 (ActiveState Corp.) based on
Python 2.4.2 (#67, Oct 30 2005, 16:11:18) [MSC v.1310 32 bit (Intel)] on
win32
"""
Did I miss something obvious ?
Thank you for any idea about this.
--
Patrick Mézard
More information about the Snowball-discuss
mailing list