[Snowball-discuss] Unicode and python bindings
Patrick Mézard
pmezard at gmail.com
Tue May 16 13:39:05 BST 2006
Hello,
Trying to solve issues I raised in a previous post
(<http://thread.gmane.org/gmane.comp.search.snowball/772/focus=772>), I
finally rewrote parts of the original Weongyo Jeong python bindings to
fit my needs. The main change is the module interface now consumes
python Unicode strings (UTF-16) instead of native strings. The idea is
that code dealing with multiple languages usually unifies first the
documents encodings into Unicode before passing them to other modules,
including stemming. With the original bindings, since I failed to use
the UTF-8 interface, I had to convert back from Unicode to specific
encodings which was at best a pain, at worst impossible.
The new version is temporary available there:
<http://perso.wanadoo.fr/patrick.mezard/dev/pysnowball-0.0.2.zip> and I
can provide a copy of the darcs (<http://abridgegame.org/darcs/>)
repository I used to rewrite my branch.
I think it still needs to be reviewed before any release (I am far from
being a python C extension expert), even if it passes the few tests I
could imagine.
What's your opinion about this?
--
Patrick Mézard
More information about the Snowball-discuss
mailing list