[Snowball-discuss] Optimising among

Fri Sep 15 00:49:44 BST 2006

On Thu, Sep 14, 2006 at 06:57:25PM +0100, Olly Betts wrote:
> So for each among I look at the first character to be checked in each
> string.  If they are all the same mod 32, I add a simple bitmap check
> which in many cases means I can give signal f before even calling
> find_among or find_among_b.

I assume you mean "all the same div 32" (not mod) here, reading the code.

This makes a lot of sense to me.  I'll give Martin a couple of days to
comment, since I've not looked at this bit of the code in detail, but if he
has no objections, I'd be happy to commit this.

I wonder if the characters in the among structures used are as nicely
chunked into groups of 32 characters in other languages (particularly
Russian).  Taking a quick look at the algorithms makes me think they are,
but if not, the same kind of idea might be applied with a larger bitmap, or
some such.  Anyway, they probably are.

> Because the strings are typically composed of lower case unaccented
> latin letters this gets most of them.  This reduces the time spent in
> find_among_b by 45% and reduces the overall runtime for running
> stemwords on the English voc.txt by 15% (the "times" are all cycle
> estimates from cachegrind since timing doesn't give stable values on
> this machine).  find_among_b still dominates the profile, but down from
> ~33% to ~22% of the time.

That's quite a nice improvement, well done!

-- 
Richard