[Snowball-discuss] a simple algorithm problem
Olly Betts
olly at survex.com
Wed Jan 5 00:55:07 GMT 2005
On Tue, Jan 04, 2005 at 10:35:13AM +0000, Martin Porter wrote:
> In retrospect, I occasionally wish groupings were not in
> the language. Instead of
>
> A) define vowel 'aeiou'
>
> one could have
>
> B) define vowel as among('a' 'e' 'i' 'o' 'u')
>
> (A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is
> faster than (B), but optimisation in the codegenerator could turn (B) into a
> bitmap as well.
>
> There are other differences however: (B) needs to be defined
> in a 'forward' or 'backward' context; non-vowel is a neat test that works
> with style (A) but not (B).
Could (A) simply be handled internally as a shorthand for defining form (B) in
both forward and backward context, without changing the meaning of existing
code? As you say, the code generator can optimise this to give the same
generated code as at present (although it may not be worth the complications of
using a bitmap for multi-byte utf-8 characters).
The manual defines "non-vowel" as the same as "(not vowel next)". Wouldn't
that work for the (B) version too? In which case just extending where "non"
can be used solves that issue.
> If groupings were NOT in the language, you could reduce the difference
> between utf-8 and single character working to the definition of a couple of
> macros PREV and NEXT (thinking of ANSI C codegeneration) that move the
> character cursor left or right by one place, and that only turn up in the
> definitions of (a) to (e) above.
It would be great if snowball could process utf-8 directly. Although
characters are variable width, you can at least write a simple and efficient
"PREV" macro for utf-8 (because the first byte of a character is always in
a particular range which isn't used for subsequent bytes).
We want utf-8 stemming for Xapian, so I'm going to have to address this
somehow...
[
Incidentally, I think there's an error in the manual where it talks about
among. Look at http://snowball.tartarus.org/p/snowman.html which says:
The effect of obeying substring when the preceding among is not obeyed is
undefined. This would happen for example here,
try($x != 617 substring)
among(...) // 'substring' is bypassed in the exceptional case where x == 617
I think substring and among are switched in the first sentence, and that should
be: "The effect of obeying *among* when the preceding *substring*" is not
obeyed is undefined."
]
Cheers,
Olly
More information about the Snowball-discuss
mailing list