[Snowball-discuss] New, and a couple of questions
Olly Betts
olly@survex.com
Wed Mar 10 14:31:02 2004
On Wed, Mar 10, 2004 at 01:48:24PM +0000, Richard Boulton wrote:
> As Martin says, you'll have to compare the strings returned. However,
> if you're worried about speed, don't use strcmp - you can write your own
> comparison routine which is faster for this case. In particular, you
> have the two lengths, so the first step is clearly to compare them - if
> they differ, the stemmed form is different from the original.
>
> Also, if a stemming operation has occurred, it will typically change the
> end of the word rather than the beginning - so compare the strings
> starting at the end.
>
> The worst case is still the case where stemming hasn't occurred, in
> which case you have to compare the whole string.
The problem here is that this worst case is likely to be several times
worse - a good hand-coded strcmp implementation is likely to load and
compare whole words if the strings are word aligned, and that could easily
be 3 times faster than the naive "load a byte from each and compare".
> However, you can
> probably speed the case where stemming _has_ occurred to probably an
> average of around 2 comparisons.
I wonder if a hybrid approach is best?
compare lengths, if not equal, stemming happened
compare the last character, if not equal, stemming happened
strcmp the whole string
(Or perhaps use memcmp with length - 1...)
For the sample vocabularies, this gives:
len last whole same language
16592 0 0 7237 danish
24439 56 376 20798 dutch
18453 1202 0 9745 english
42602 0 0 7398 finnish
16871 0 22 3510 french
23772 81 1641 9539 german
31371 4 1 4118 italian
14203 0 0 6425 norwegian
17394 1955 0 11079 porter
27729 2 0 4285 portuguese
45680 0 0 3993 russian
24543 45 848 2954 spanish
20769 0 0 9854 swedish
len is the number of differing cases which are caught by the length test
last is the number of differing cases caught by looking at the last character
whole is the number of differing cases which required a whole string compare
same is the number of cases which are actually the same
language is the language of the stemmer
Note that the sample vocabularies probably don't represent the frequencies of
these cases in real text, but they're easy to use as you have a file of words
and a corresponding file of their stems.
So if you're worried about performance, profile the different approaches on
representative samples of the data you'll be processing!
Cheers,
Olly