[Snowball-discuss] Hungarian characters in hungarian/stop.txt
Tom Lane
tgl at sss.pgh.pa.us
Tue Jun 10 21:08:39 BST 2014
The Postgres project received a complaint about misspellings in our copy
of the Snowball Hungarian stop-word list:
http://www.postgresql.org/message-id/20140610081936.2599.96998@wrigleys.postgresql.org
basically to the effect that in place of Š(U+0151) we have õ (U+00E5).
I wonder first if someone could confirm or deny that this substitution
is correct?
I believe that the way we got this file in the first place was to
scrape it from
http://snowball.tartarus.org/algorithms/hungarian/stop.txt
since it's not in the Snowball distribution. It looks to me like the
webserver delivers that page in LATIN1 (ISO-8859-1) encoding, which would
go far towards explaining the encoding problem, since U+0151 isn't
representable in LATIN1. So now I'm wondering what other similar mistakes
there may be in the non-LATIN1 languages. Is there a method for obtaining
this and the other stopword files in UTF8?
It'd be even better if there were an up-to-date release tarball including
these files as well as the stemmers :-)
regards, tom lane
More information about the Snowball-discuss
mailing list