[Snowball-discuss] New, and a couple of questions

Wed Mar 10 14:31:02 2004

On Wed, Mar 10, 2004 at 01:48:24PM +0000, Richard Boulton wrote:
> As Martin says, you'll have to compare the strings returned.  However, 
> if you're worried about speed, don't use strcmp - you can write your own 
> comparison routine which is faster for this case.  In particular, you 
> have the two lengths, so the first step is clearly to compare them - if 
> they differ, the stemmed form is different from the original.
> 
> Also, if a stemming operation has occurred, it will typically change the 
>  end of the word rather than the beginning - so compare the strings 
> starting at the end.
> 
> The worst case is still the case where stemming hasn't occurred, in 
> which case you have to compare the whole string.

The problem here is that this worst case is likely to be several times
worse - a good hand-coded strcmp implementation is likely to load and
compare whole words if the strings are word aligned, and that could easily
be 3 times faster than the naive "load a byte from each and compare".

> However, you can 
> probably speed the case where stemming _has_ occurred to probably an 
> average of around 2 comparisons.

I wonder if a hybrid approach is best?

compare lengths, if not equal, stemming happened

compare the last character, if not equal, stemming happened

strcmp the whole string

(Or perhaps use memcmp with length - 1...)

For the sample vocabularies, this gives:

len     last    whole   same    language

16592   0       0       7237    danish
24439   56      376     20798   dutch
18453   1202    0       9745    english
42602   0       0       7398    finnish
16871   0       22      3510    french
23772   81      1641    9539    german
31371   4       1       4118    italian
14203   0       0       6425    norwegian
17394   1955    0       11079   porter
27729   2       0       4285    portuguese
45680   0       0       3993    russian
24543   45      848     2954    spanish
20769   0       0       9854    swedish

len is the number of differing cases which are caught by the length test
last is the number of differing cases caught by looking at the last character
whole is the number of differing cases which required a whole string compare
same is the number of cases which are actually the same
language is the language of the stemmer

Note that the sample vocabularies probably don't represent the frequencies of
these cases in real text, but they're easy to use as you have a file of words
and a corresponding file of their stems.

So if you're worried about performance, profile the different approaches on
representative samples of the data you'll be processing!

Cheers,
    Olly