[Snowball-discuss] New, and a couple of questions

Richard Boulton richard@tartarus.org
Wed Mar 10 13:50:01 2004


Martin Porter wrote:
> There is no fast way of discovering if a word has been stemmed. You could
> set a flag in the various functions of q/utilities.c that alter z->p, but
> this is not a general solution, since Snowball can use auxiliary strings
> that may be altered while the main string remains unaltered - although none
> of the current stemmers would do that. So you have to use strcmp or equivalent. 

As Martin says, you'll have to compare the strings returned.  However, 
if you're worried about speed, don't use strcmp - you can write your own 
comparison routine which is faster for this case.  In particular, you 
have the two lengths, so the first step is clearly to compare them - if 
they differ, the stemmed form is different from the original.

Also, if a stemming operation has occurred, it will typically change the 
  end of the word rather than the beginning - so compare the strings 
starting at the end.

The worst case is still the case where stemming hasn't occurred, in 
which case you have to compare the whole string.  However, you can 
probably speed the case where stemming _has_ occurred to probably an 
average of around 2 comparisons.

...
> But my machine is fairly slow: on Richard
> Boulton's machine it would take a tenth of that time.

Actually, the original test takes 17 seconds on my machine.  (Unless I 
turn optimising on, in which case it notices that strcmp is a pure 
function (ie, has no side effects) and I'm ignoring the return value, so 
doesn't bother to call it at all, and takes 0.1 seconds.)