[Snowball-discuss] a simple algorithm problem

Olly Betts olly at survex.com
Thu Jan 6 16:12:32 GMT 2005


On Thu, Jan 06, 2005 at 09:12:34AM +0000, Martin Porter wrote:
> Working with 64K characters, a bitmap might go up to 8K in size, which is
> not an intolerable overhead. In practice they are much smaller, since the
> codes we need in the stemmers do not have high Unicode values.

I don't really see this as a big problem.  You can always just dump out
a C switch statement and let the C compiler's code generator decide on
the best way to represent it.

> So one idea is to declare 'utf8' in the Snowball script, allowing character
> defs in the range 0-64K, as in the 2-byte character version. Characters
> could be written with their Unicode values, and encoded in utf-8 form in
> strings.

I doubt most of the stemmers care about characters beyond a 2 byte utf-8
representation.  So the answer is likely to be the same for any sequence
with the same first 2 bytes, and you don't even need to decode the character
fully.  Look at the first byte to see if this is a one byte, two byte,
or longer sequence, and have tables for the first two and a fixed answer
for the third.

> Looking again at the cursor movement issues:
> 
> If I've got my thinking right: in goto and gopast, cursor movement can be
> done one byte at a time, (cursor++; or cursor--;). If expression C is made
> up of well-formed utf-8 strings. 'goto C' must either fail, or end on a
> valid character boundary.

That's true.

> 'next' requires its own implementation, and as you say, backward movement is
> not a problem.
> 
> 'next' is implicit in character tests (vowel, non-vowel), and hop N (=
> 'next' done N times).
> 
> I'm not sure 'size' is used in the Snowball scripts: it could be defined to
> give an approximate answer in the utf-8 case, or implemented exactly.

Hmm, I don't like the idea that the encoding would affect the stemming
algorithm.

Thinking about how size might be used, it'll be in comparisons.  The
number of bytes is an upper bound on the number of characters.  You can
also get a lower bound by dividing it by 6 (IIRC).  The code generator
can probably use these two bounds to avoid actually walking the string
completely in many cases.  E.g. if you want to test if the stem is
least 10 characters, and it's 8 bytes long, you don't need to walk
the string.

> Incidentally, do you have a view on the use of free-floating accents?
> (Unicode 0300-036F)

Ideally you want to normalise text before stemming.  There's information
on unicode normalisation here:

http://www.unicode.org/unicode/reports/tr15/

I've not explored this in great depth, but for latinate languages, I
suspect pragmatically you can just strip out combining accents for
stemming purposes, and the results will be pretty much what you'd get
for text that was written without access to accented forms - umlauts in
German are the main exception I'm aware of ("ä" (a-umlaut) -> "ae",
etc).  But I believe that in Swedish, you'd just write "a" in this
situation.

Cheers,
    Olly



More information about the Snowball-discuss mailing list