[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Tom Lane tgl at sss.pgh.pa.us
Wed Jun 11 01:57:40 BST 2014


Olly Betts <olly at survex.com> writes:
> On Tue, Jun 10, 2014 at 05:48:08PM -0400, Tom Lane wrote:
>> A comparison of algorithms/hungarian/stem_ISO_8859_1.sbl's list of
>> "LATIN I" characters to Wikipedia suggests that its identification
>> of u-double-acute as U+00FB is also mistaken: that character is
>> really LATIN2 0xFB which maps to Unicode U+0171.  The other characters
>> called out in the list have the same codes in LATIN1 and LATIN2, which
>> may account for why the bug hasn't been noticed long since.

> I've submitted a fix for the algorithm here:
> https://github.com/snowballstem/snowball/pull/4

Thanks for the quick response!  But I think you need this in
the new hungarian/stem_Unicode.sbl file:

-stringdef uq  hex 'FB'  //u-double acute
+stringdef uq  hex '171' //u-double acute

			regards, tom lane



More information about the Snowball-discuss mailing list