[Snowball-discuss] Update on regex approach

Martin Porter martin_porter@softhome.net
Wed, 08 May 2002 15:32:45 -0600


Allan,

Quite apart from Oleg's observation (which I feel carries a lot of weight),
I am not sure that optimising the 'measure' function is going to make so
much difference.

Here is what I found when using ANSI C.

The earlier Porter stemmer computes the measure every time an ending is
found whose removal depends on the stem length. The later Porter stemmer I
wrote (in particular the Snowball generated stemmer) initially finds the
critical positions p1 and p2 just once and tests whether the ending falls
after one or other of these positions. Effectively in the later stemmer the
measure function is called just once per word, and I expected to see a speed
improvement. In fact there was no discernible difference. The reason is that
in the earlier stemmer, for an average word, the measure function is called
about once, or very slightly more than once. This is because finding an
ending is a rare event.

In terms of the whole stemming process, not much time is spent in the
measure function anyway. For the Snowball stemmer, here is a breakdown of
the usage of the 4.80 secs required to stem the sample vocabulary on my
Linux machine:

  measure   0.59
  Step 1a   0.18
  Step 1b   0.31
  Step 1c   0.21
  Step 2    0.43
  Step 3    0.34
  Step 4    0.48
  Step 5a   0.30
  Step 5b   0.05

(The residue is I/O time). The measure function takes up 1/6 of the whole
process roughly.

Of course, you might get a different breakdown in Perl (which on my machine
takes 80 secs incidentally), but this is still a reasoble guide of what you
might expect.

Martin



_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss