[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Tom Lane tgl at sss.pgh.pa.us
Tue Jun 10 21:08:39 BST 2014


The Postgres project received a complaint about misspellings in our copy
of the Snowball Hungarian stop-word list:
http://www.postgresql.org/message-id/20140610081936.2599.96998@wrigleys.postgresql.org
basically to the effect that in place of ő (U+0151) we have õ (U+00E5).

I wonder first if someone could confirm or deny that this substitution
is correct?

I believe that the way we got this file in the first place was to
scrape it from
http://snowball.tartarus.org/algorithms/hungarian/stop.txt
since it's not in the Snowball distribution.  It looks to me like the
webserver delivers that page in LATIN1 (ISO-8859-1) encoding, which would
go far towards explaining the encoding problem, since U+0151 isn't
representable in LATIN1.  So now I'm wondering what other similar mistakes
there may be in the non-LATIN1 languages.  Is there a method for obtaining
this and the other stopword files in UTF8?

It'd be even better if there were an up-to-date release tarball including
these files as well as the stemmers :-)

			regards, tom lane



More information about the Snowball-discuss mailing list