[Snowball-discuss] Lingua::Stem

Benjamin Franz snowhare@nihongo.org
Mon Apr 14 22:14:02 2003


I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
version that 'wrappers' the Snowball based Perl stemmers (along with some
non-Snowball based versions) into the standarized Lingua::Stem API.  

Something I noticed today while looking for anyone who might be using the
Lingua::Stem Perl module was that last year it was mentioned as being a
poor performer here on the Snowball-Discuss list. After examining the
benchmark code used (see
<URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
I discovered the main reason it performed poorly in the tests was it was
being used in its absolutely lowest performing mode (one word at a time,
no stem caching). I thought it would be worth re-doing the benchmark using
its faster modes. So, here is the benchmark code redone to take full
advantage of Lingua::Stem's performance features:

#!/usr/bin/perl

use Benchmark;
use Lingua::Stem qw (:all :caching);

my @word = grep chomp, <>;

#################################################
# Preload word list so we have identical runs
my @word_list = ();
my $s = 100;
for (1..$s) {
  my $result;
  my $w = @word[rand(scalar(@word))];
  push (@word_list,$w);
}

# Word by word (original benchmark)
my ($n,$pu,$ps) = (0,0,0);
foreach my $w (@word_list) {
  my $result;
  my $t = timeit(2000, sub { ($result) = stem($w) } );
  $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
}
printf  "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

# Processed in batch instead of one by one
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf  "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

# Processed in batch instead of one by one, with caching turned on
stem_caching({ -level => 2});
my ($n,$pu,$ps) = (0,0,0);
my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
$pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
printf  "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;

#################################################################

The results of running it on my home Celeron 500 based Redhat 7.3 
Linux system with Perl 5.6.1 using the voc.txt file are as follows:


ORIGINAL: Average random cross-sectional stem rate for 100 words:  3718.16 Hz (n=200000).
BATCHED:  Average random cross-sectional stem rate for 100 words: 13097.58 Hz (n=200000).
CACHED:   Average random cross-sectional stem rate for 100 words: 88105.73 Hz (n=200000).

Batching alone is about 3.5X improvement. Adding stem caching as well
gives a 23.7X improvement over the one word at a time processing (and, I
judge, leaves even the fastest performers benchmared last summer
completely in the dust by roughly a factor of 8 to 10x). Since I have
wrappered the non-English Snowball stemmers, they should get similiar
performance improvments from the stem cache when used via the base
Stem::Lingua modules in the new 0.60 release when used in a mode where the
caching is significant.

-- 
Benjamin Franz

"If the code and the comments disagree, then both are probably wrong."
                                        -- Norm Schryer, Bell Labs