[Snowball-discuss] Greek stemmer
Martin Porter
martin.f.porter at gmail.com
Thu Sep 1 15:07:29 BST 2016
Hello Oleg,
I was interested to see your work.
It is actually possible to have snowball strings made up of UTF-8
encoded Unicode characters, so long as the script is compiled to be
applied to text in the same form. Historically, the use of pure ASCII,
and stringdefs, was for the benefit of the different character
encoding schemes that used to abound, and well as making editing
easier on a standard keyboard. I thought I'd mention this, since the
snowball scripts can be made to look a lot cleaner, even if less
general, if the stringdef encodings are bypassed.
I was surprised by the large number of separate steps in your stemmer.
Could not many be amalgamated into single steps? That would of course
speed things up. I also wondered about the absence of 'region'
markers, within which stems don't get removed -- so common in the
other stemmers.
More information about the Snowball-discuss
mailing list