[Snowball-discuss] Unicode and python bindings

Patrick Mézard pmezard at gmail.com
Tue May 16 13:39:05 BST 2006


Hello,

Trying to solve issues I raised in a previous post 
(<http://thread.gmane.org/gmane.comp.search.snowball/772/focus=772>), I 
finally rewrote parts of the original Weongyo Jeong python bindings to 
fit my needs. The main change is the module interface now consumes 
python Unicode strings (UTF-16) instead of native strings. The idea is 
that code dealing with multiple languages usually unifies first the 
documents encodings into Unicode before passing them to other modules, 
including stemming. With the original bindings, since I failed to use 
the UTF-8 interface, I had to convert back from Unicode to specific 
encodings which was at best a pain, at worst impossible.

The new version is temporary available there: 
<http://perso.wanadoo.fr/patrick.mezard/dev/pysnowball-0.0.2.zip> and I 
can provide a copy of the darcs (<http://abridgegame.org/darcs/>) 
repository I used to rewrite my branch.

I think it still needs to be reviewed before any release (I am far from 
being a python C extension expert), even if it passes the few tests I 
could imagine.

What's your opinion about this?

--
Patrick Mézard




More information about the Snowball-discuss mailing list