[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Tue Jun 10 22:48:08 BST 2014

I wrote:
> The Postgres project received a complaint about misspellings in our copy
> of the Snowball Hungarian stop-word list:
> http://www.postgresql.org/message-id/20140610081936.2599.96998@wrigleys.postgresql.org
> basically to the effect that in place of ő (U+0151) we have õ (U+00E5).

Ah, sorry, I fat-fingered a mental octal-to-hex conversion.  It looks
like what we have is U+00F5 in place of U+0151, which means that the root
of the confusion here appears to be LATIN1 vs LATIN2.  U+00F5 is of course
also code 0xF5 in ISO-8859-1, while o-double-acute (U+0151) turns out to
have code 0xF5 in ISO-8859-2.  So I think somebody submitted a file in
LATIN2 once upon a time and it was misinterpreted as LATIN1.

However, this problem doesn't stop with the stopword file.  AFAICS, the
Snowball source code for the Hungarian stemmer believes that (1) LATIN1
is a suitable character set for it to work in, and (2) o-double-acute is
0xF5 in LATIN1.  At this point I find both things dubious.

A comparison of algorithms/hungarian/stem_ISO_8859_1.sbl's list of
"LATIN I" characters to Wikipedia suggests that its identification
of u-double-acute as U+00FB is also mistaken: that character is
really LATIN2 0xFB which maps to Unicode U+0171.  The other characters
called out in the list have the same codes in LATIN1 and LATIN2, which
may account for why the bug hasn't been noticed long since.

			regards, tom lane