[Snowball-discuss] Haskell bindings, and issues with UTF_8

dag.odenhall at gmail.com dag.odenhall at gmail.com
Sat Dec 22 19:17:22 GMT 2012


Hello list

I have created a new package for Haskell bindings to the Snowball library:

http://hackage.haskell.org/package/snowball

There already was an old package but it has some problems that my package
fixes. In particular, it doesn't require your system locale to be UTF-8 and
it is thread safe and without memory leaks.

In this process, I have also discovered that the UTF_8 versions of the
stemmers in libstemmer_c appears to be broken. Testing my bindings against
the test files in the Snowball distribution (the diffs.txt files) it failed
early, basically as soon as it encountered any unicode (in Hungarian, which
was the first language it tested). I also had a problem with my app that
uses the bindings crashing when encountering unicode in Swedish (although
actually using the English stemmer).

Investigating it I discovered that libstemmer seemed to return invalid
UTF-8 for words like "Malmö". The last character there is two bytes, but
libstemmer returns only the first byte. This problem had gone unnoticed
with the older bindings because it ignored errors in the decoding process.

I had a hunch the problem was with the UTF_8 encoding specifically, so I
changed my bindings to use the other encodings where available. After that
change, all tests except for "porter" passes! The problems with porter
seems unrelated to encodings, maybe simply the test files are wrong. The
only stemmer that still uses UTF_8 is turkish, which has no test files the
Snowball distribution, so I can't easily verify if it works correctly or
not.

I had also experienced some weirdness when benchmarking my bindings, where
reusing a stemmer instance became slower when stemming thousands of words
(faster for a few hundred). This problem also went away when I stopped
using UTF_8.

Cheerio, and thanks for a great library!

Dag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20121222/cfdff669/attachment.htm>


More information about the Snowball-discuss mailing list