[Snowball-discuss] Lingua::Stem
Oleg Bartunov
oleg@sai.msu.su
Tue Apr 15 03:47:02 2003
Did you try Lingua::Stem::Snowball ?
It's not pure perl wrapper but uses XS, so should be much faster.
Also, it doesn't add "additional errors" :-) I recollect Martin
has worried about many errors in selfmade stemmers claiming they are
Porter's stemmer.
Oleg
On Mon, 14 Apr 2003, Benjamin Franz wrote:
> I'm the maintainer of the Perl 'Lingua::Stem' module. I've released a new
> version that 'wrappers' the Snowball based Perl stemmers (along with some
> non-Snowball based versions) into the standarized Lingua::Stem API.
>
> Something I noticed today while looking for anyone who might be using the
> Lingua::Stem Perl module was that last year it was mentioned as being a
> poor performer here on the Snowball-Discuss list. After examining the
> benchmark code used (see
> <URL:http://www.snowball.tartarus.org/archives/snowball-discuss/0193.html>)
> I discovered the main reason it performed poorly in the tests was it was
> being used in its absolutely lowest performing mode (one word at a time,
> no stem caching). I thought it would be worth re-doing the benchmark using
> its faster modes. So, here is the benchmark code redone to take full
> advantage of Lingua::Stem's performance features:
>
> #!/usr/bin/perl
>
> use Benchmark;
> use Lingua::Stem qw (:all :caching);
>
> my @word = grep chomp, <>;
>
> #################################################
> # Preload word list so we have identical runs
> my @word_list = ();
> my $s = 100;
> for (1..$s) {
> my $result;
> my $w = @word[rand(scalar(@word))];
> push (@word_list,$w);
> }
>
> # Word by word (original benchmark)
> my ($n,$pu,$ps) = (0,0,0);
> foreach my $w (@word_list) {
> my $result;
> my $t = timeit(2000, sub { ($result) = stem($w) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> }
> printf "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one, with caching turned on
> stem_caching({ -level => 2});
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> #################################################################
>
> The results of running it on my home Celeron 500 based Redhat 7.3
> Linux system with Perl 5.6.1 using the voc.txt file are as follows:
>
>
> ORIGINAL: Average random cross-sectional stem rate for 100 words: 3718.16 Hz (n=200000).
> BATCHED: Average random cross-sectional stem rate for 100 words: 13097.58 Hz (n=200000).
> CACHED: Average random cross-sectional stem rate for 100 words: 88105.73 Hz (n=200000).
>
> Batching alone is about 3.5X improvement. Adding stem caching as well
> gives a 23.7X improvement over the one word at a time processing (and, I
> judge, leaves even the fastest performers benchmared last summer
> completely in the dust by roughly a factor of 8 to 10x). Since I have
> wrappered the non-English Snowball stemmers, they should get similiar
> performance improvments from the stem cache when used via the base
> Stem::Lingua modules in the new 0.60 release when used in a mode where the
> caching is significant.
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83