[Snowball-discuss] Optimising among

Olly Betts olly at survex.com
Fri Sep 15 05:00:47 BST 2006


On Fri, Sep 15, 2006 at 02:05:54AM +0100, Olly Betts wrote:
> It might also be worth special casing when we have 2 or fewer candidates
> since checking "z->p[z->c] == 'a' || z->p[z->v] == 'b'" is likely to be
> faster than the bit shifting and and-ing.  And this would also allow us to
> add a check for the among with "'s'", "'s" or "'" in the English stemmer
> which current fails to fit in a 32 character block.
> 
> The other case I don't deal with in the English stemmer is one which
> includes an empty string - this just needs a slightly different variant
> of the bitmap check (which I'm just trying to implement now).

Implementing all the above takes the reduction in running time for
the English stemmer to 46%!

That probably means that the snowball version is now comparable to the
hand-coded C (Martin said the difference was about a factor of 2).

I suspect not all languages will benefit as much - for English I'm now
adding shortcut checks for every among, but I'm not able to for every
language.

http://www.oligarchy.co.uk/xapian/patches/snowball-among-speedup-2.patch

I'd be interested to hear how this benchmarks when measuring real time
rather than running under cachegrind.

> Looking at what doesn't get encoded for other languages, it's pretty
> common to have all lower case letters and one other character (typically
> an upper case letter, accented character, or an apostrophe).  Perhaps
> that's worth checking for too.

There are 58 amongs which are still not handled of which 31 could be
with a bitmap plus a single additional value check (5 of these can
also be done with 3 compares, which is probably more efficient).

Incidentally, I checked and for the current algorithms not a single
extra case could be handled by allowing the bitmap to cover a
non-aligned 32 value range instead of an aligned one.

Cheers,
    Olly



More information about the Snowball-discuss mailing list