[Snowball-discuss] Optimising among

Olly Betts olly at survex.com
Thu Sep 14 18:57:25 BST 2006


On Tue, Sep 12, 2006 at 09:44:31AM +0200, Martin Porter wrote:
> >so that's a good candidate for optimising (if only I
> >understood it!)
> 
> -- and it has been optimised accordingly.

I wasn't trying to suggest it hadn't been - just noting that a small
gain here would be more beneficial than a small gain elsewhere.

> Most among calls give signal f, that is, usually the end of a word is not
> found in the among(...). This simple fact is not utilised in any way in the
> implementation of among(...), and I've often thought that it might be.
> 
> Of course the among(...) implementation is a general procedure which misses
> obvious optimisations for certain classes of string. A test of the form
> 
>     among('a','e','i','o','u')
> 
> could be done by bit-map look up etc.

I've just tried out a variant of this idea which makes use of your
observation that among usually fails, coupled with noticing that in
most cases the first character tested is the same for a number of
options in the among, so even for a large among, there are usually
a fair small number of first characters which will match.

So for each among I look at the first character to be checked in each
string.  If they are all the same mod 32, I add a simple bitmap check
which in many cases means I can give signal f before even calling
find_among or find_among_b.

Because the strings are typically composed of lower case unaccented
latin letters this gets most of them.  This reduces the time spent in
find_among_b by 45% and reduces the overall runtime for running
stemwords on the English voc.txt by 15% (the "times" are all cycle
estimates from cachegrind since timing doesn't give stable values on
this machine).  find_among_b still dominates the profile, but down from
~33% to ~22% of the time.

I have some other code generation patches applied in both the before
and after cases, but I doubt they make much difference to this.

And I've checked that the stemmers give the same results for english,
danish, finnish and hungarian before and after this change.

Here's the patch:

http://oligarchy.co.uk/xapian/patches/snowball-among-speedup.patch

Cheers,
    Olly



More information about the Snowball-discuss mailing list