[Snowball-discuss] Hungarian characters in hungarian/stop.txt
Tom Lane
tgl at sss.pgh.pa.us
Wed Jun 11 01:57:40 BST 2014
Olly Betts <olly at survex.com> writes:
> On Tue, Jun 10, 2014 at 05:48:08PM -0400, Tom Lane wrote:
>> A comparison of algorithms/hungarian/stem_ISO_8859_1.sbl's list of
>> "LATIN I" characters to Wikipedia suggests that its identification
>> of u-double-acute as U+00FB is also mistaken: that character is
>> really LATIN2 0xFB which maps to Unicode U+0171. The other characters
>> called out in the list have the same codes in LATIN1 and LATIN2, which
>> may account for why the bug hasn't been noticed long since.
> I've submitted a fix for the algorithm here:
> https://github.com/snowballstem/snowball/pull/4
Thanks for the quick response! But I think you need this in
the new hungarian/stem_Unicode.sbl file:
-stringdef uq hex 'FB' //u-double acute
+stringdef uq hex '171' //u-double acute
regards, tom lane
More information about the Snowball-discuss
mailing list