[Snowball-discuss] Hungarian characters in hungarian/stop.txt

Olly Betts olly at survex.com
Wed Jun 11 11:37:54 BST 2014


On Wed, Jun 11, 2014 at 07:46:26AM +0100, Martin Porter wrote:
> And the absence of Unicode source for the Hungarian stemmer is not
> itself a bug. (We have a Unicode source for Romanian because I wrote
> the Romanian stemmer.)

We don't have a "stem_Unicode.sbl" file for any of the stemmers which
use ISO-8859-1 (or ASCII), but that's because the character constant
values are the same for ISO-8859-1 and ASCII as they are for Unicode.

If <language>/stem_Unicode.sbl doesn't exist, the snowball build system
copies <language>/stem_ISO_8859_1.sbl to <language>/stem_Unicode.sbl to
create it:

algorithms/%/stem_Unicode.sbl: algorithms/%/stem_ISO_8859_1.sbl
        cp $^ $@

So really, we do have a "Unicode source" for all the stemmers - if
there's a "stem_ISO_8859_1.sbl" file, that's also the Unicode source.
All the languages which don't have "stem_ISO_8859_1.sbl" do have
"stem_Unicode.sbl" (and in many cases also a version with a different
encoding).

My fix renamed the Hungarian stem_ISO_8859_1.sbl to stem_ISO_8859_2.sbl
which is why Hungarian then needs to have an explicit stem_Unicode.sbl.

Cheers,
    Olly



More information about the Snowball-discuss mailing list