[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Olly Betts olly at survex.com
Wed Jun 11 01:42:47 BST 2014


On Tue, Jun 10, 2014 at 05:48:08PM -0400, Tom Lane wrote:
> A comparison of algorithms/hungarian/stem_ISO_8859_1.sbl's list of
> "LATIN I" characters to Wikipedia suggests that its identification
> of u-double-acute as U+00FB is also mistaken: that character is
> really LATIN2 0xFB which maps to Unicode U+0171.  The other characters
> called out in the list have the same codes in LATIN1 and LATIN2, which
> may account for why the bug hasn't been noticed long since.

I've submitted a fix for the algorithm here:

https://github.com/snowballstem/snowball/pull/4

(The travis failure is bogus - Richard hasn't finished setting it up).

I had a quick try to fix up the test vocabulary correspondingly, but
my first attempt didn't work.  I'll have a look again when I've more
time, if nobody else sorts it out first.

Cheers,
    Olly



More information about the Snowball-discuss mailing list