[Snowball-discuss] Optimising among
richard at lemurconsulting.com
richard at lemurconsulting.com
Fri Sep 15 00:49:44 BST 2006
On Thu, Sep 14, 2006 at 06:57:25PM +0100, Olly Betts wrote:
> So for each among I look at the first character to be checked in each
> string. If they are all the same mod 32, I add a simple bitmap check
> which in many cases means I can give signal f before even calling
> find_among or find_among_b.
I assume you mean "all the same div 32" (not mod) here, reading the code.
This makes a lot of sense to me. I'll give Martin a couple of days to
comment, since I've not looked at this bit of the code in detail, but if he
has no objections, I'd be happy to commit this.
I wonder if the characters in the among structures used are as nicely
chunked into groups of 32 characters in other languages (particularly
Russian). Taking a quick look at the algorithms makes me think they are,
but if not, the same kind of idea might be applied with a larger bitmap, or
some such. Anyway, they probably are.
> Because the strings are typically composed of lower case unaccented
> latin letters this gets most of them. This reduces the time spent in
> find_among_b by 45% and reduces the overall runtime for running
> stemwords on the English voc.txt by 15% (the "times" are all cycle
> estimates from cachegrind since timing doesn't give stable values on
> this machine). find_among_b still dominates the profile, but down from
> ~33% to ~22% of the time.
That's quite a nice improvement, well done!
--
Richard
More information about the Snowball-discuss
mailing list