[Snowball-discuss] Hungarian characters in hungarian/stop.txt
Olly Betts
olly at survex.com
Wed Jun 11 11:37:54 BST 2014
On Wed, Jun 11, 2014 at 07:46:26AM +0100, Martin Porter wrote:
> And the absence of Unicode source for the Hungarian stemmer is not
> itself a bug. (We have a Unicode source for Romanian because I wrote
> the Romanian stemmer.)
We don't have a "stem_Unicode.sbl" file for any of the stemmers which
use ISO-8859-1 (or ASCII), but that's because the character constant
values are the same for ISO-8859-1 and ASCII as they are for Unicode.
If <language>/stem_Unicode.sbl doesn't exist, the snowball build system
copies <language>/stem_ISO_8859_1.sbl to <language>/stem_Unicode.sbl to
create it:
algorithms/%/stem_Unicode.sbl: algorithms/%/stem_ISO_8859_1.sbl
cp $^ $@
So really, we do have a "Unicode source" for all the stemmers - if
there's a "stem_ISO_8859_1.sbl" file, that's also the Unicode source.
All the languages which don't have "stem_ISO_8859_1.sbl" do have
"stem_Unicode.sbl" (and in many cases also a version with a different
encoding).
My fix renamed the Hungarian stem_ISO_8859_1.sbl to stem_ISO_8859_2.sbl
which is why Hungarian then needs to have an explicit stem_Unicode.sbl.
Cheers,
Olly
More information about the Snowball-discuss
mailing list