[Snowball-discuss] Hungarian characters in hungarian/stop.txt
Olly Betts
olly at survex.com
Wed Jun 11 07:28:20 BST 2014
On Wed, Jun 11, 2014 at 07:12:21AM +0100, Martin Porter wrote:
> Failure of the older contributions to conform to Unicode is a
> nuisance, but not an error.
This case is clearly an error - the stemmer is marked as being
ISO-8859-1, but the character values it uses are actually for ISO-8859-2
(all but two happen to be the same in ISO-8859-1 and ISO-8859-2).
Look at the comments here describe each character:
stringdef oq hex 'F5' //o-double acute
stringdef u' hex 'FA' //u-acute
stringdef u" hex 'FC' //u-umlaut
stringdef uq hex 'FB' //u-double acute
0xF5 and 0xFB are not the those characters in ISO-8859-1, but they are
in ISO-8859-2.
You can also see that ISO-8859-2 is listed as covering Hungarian, but
ISO-8859-1 is not:
http://en.wikipedia.org/wiki/ISO/IEC_8859-2
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
The "Unicode" version is currently built from the same file, but with
different snowball compiler options - that is correct if it was really
written for ISO-8859-1, but since it isn't, results in it using the
wrong characters too. Instead we should have a separate Unicode version
of the sbl source, like we have for Romanian (where the stemmer is also
ISO-8859-2).
Cheers,
Olly
More information about the Snowball-discuss
mailing list