[Snowball-discuss] Snowball is Thread Safe?

Richard Boulton richard at tartarus.org
Thu Feb 17 14:49:20 GMT 2011


On 17 February 2011 14:38, Michel Lemay <mlemay at coveo.com> wrote:
> I was playing with the stemmer and in order to get some performance, I
> parallelized it with TBB.
>
> It crashed instantly..  Having one instance of the stemmer per thread fix
> the problem..

Yes, that will be neccessary: as you observe, the stemmer objects hold
state which is used while calculating each stem.  In fact, there isn't
really any reason for a stemmer object to exist other than to hold
state!

> This finding is rather surprising!  Any reason why it has been implemented
> in this way: sharing internal buffer instead of using stack storage ?

I suppose the main answer is "for convenience of writing snowball".
Parallelism wasn't heavily on the radar when snowball was initially
written, too.

Were you hoping to split the work performed by each call to a stemmer
amongst many processors?  This doesn't seem particularly likely to be
efficient to me; each call to a stemmer is pretty fast, and works on a
small amount of data. Trying to share the workload of a single call
among many processors seems likely to run into cache contention issues
to me, which might even make a parallel implementation slower.

Instead, it seems to be that you'd be better having each call to a
stemmer go to a single processor, and sharing those calls among many
stemmers; which, of course, you can do with the current implementation
of snowball, as long as you instantiate one stemmer per thread.

-- 
Richard



More information about the Snowball-discuss mailing list