[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Olly Betts olly at survex.com
Wed Jun 11 07:28:20 BST 2014


On Wed, Jun 11, 2014 at 07:12:21AM +0100, Martin Porter wrote:
> Failure of the older contributions to conform to Unicode is a
> nuisance, but not an error.

This case is clearly an error - the stemmer is marked as being
ISO-8859-1, but the character values it uses are actually for ISO-8859-2
(all but two happen to be the same in ISO-8859-1 and ISO-8859-2).

Look at the comments here describe each character:

stringdef oq  hex 'F5'  //o-double acute
stringdef u'  hex 'FA'  //u-acute
stringdef u"  hex 'FC'  //u-umlaut
stringdef uq  hex 'FB'  //u-double acute

0xF5 and 0xFB are not the those characters in ISO-8859-1, but they are
in ISO-8859-2.

You can also see that ISO-8859-2 is listed as covering Hungarian, but
ISO-8859-1 is not:

http://en.wikipedia.org/wiki/ISO/IEC_8859-2
http://en.wikipedia.org/wiki/ISO/IEC_8859-1

The "Unicode" version is currently built from the same file, but with
different snowball compiler options - that is correct if it was really
written for ISO-8859-1, but since it isn't, results in it using the
wrong characters too.  Instead we should have a separate Unicode version
of the sbl source, like we have for Romanian (where the stemmer is also
ISO-8859-2).

Cheers,
    Olly



More information about the Snowball-discuss mailing list