[Snowball-discuss] Lingua::Stem

Oleg Bartunov oleg@sai.msu.su
Tue Apr 15 04:44:02 2003


On Mon, 14 Apr 2003, Benjamin Franz wrote:

> On Tue, 15 Apr 2003, Oleg Bartunov wrote:
>
> > Did you try Lingua::Stem::Snowball ?
>
> No...It doesn't appear to be available on CPAN. ;)
>

Aha, we just don't have time to submit it to CPAN :)

> I'm trying to compile the snowball code from sources now, but the make
> files appear to be 'fragile' - they are not yet compiling for me.
> Hmmm...You are aware that the module can't be compiled by following the
> directions provided if using the 'porter' (not 'porter2')  stemmer?

it's available from snowball site, but the latest version could be downloaded
from http://openfts.sourceforge.net/contributions.shtml


>
> Ok. I've got the 'porter2' stemmer installed as 'english'. Benching...
>
> > It's not pure perl wrapper but uses XS, so should be much faster.
> > Also, it doesn't add "additional errors" :-) I recollect Martin
> > has worried about many errors in selfmade stemmers claiming they are
> > Porter's stemmer.
>
> Not bad. Not great, but not bad. Snowball comes in about 2X faster than
> the slowest mode of Lingua::Stem - but substantially slower than either
> the batch and the batch+cache modes of the pure Perl Lingua::Stem.
>
> SNOWBALL: Average random cross-sectional stem rate for 100 words: 7930.21 Hz (n=200000).
> ORIGINAL: Average random cross-sectional stem rate for 100 words: 3644.98 Hz (n=200000).
> BATCHED: Average random cross-sectional stem rate for 100 words: 11848.34 Hz (n=200000).
> CACHED: Average random cross-sectional stem rate for 100 words: 86580.09 Hz (n=200000).
>
> I suspect you have underestimated both the performance of well written
> Perl and the 'overhead' of the Perl-XS interface.  Processing words across
> the Perl-XS interface one by one is _EXPENSIVE_ in CPU time.
>
> ########################################################################################
>
> #!/usr/bin/perl
>
> use Benchmark;
> use Lingua::Stem qw (:all :caching);
> use Lingua::Stem::Snowball;
>
>
> my @word = grep chomp, <>;
>
> #################################################
> # Preload word list so we have identical runs
> my @word_list = ();
> my $s = 100;
> for (1..$s) {
>   my $result;
>   my $w = @word[rand(scalar(@word))];
>   push (@word_list,$w);
> }
>
> # Word by word using Snowball
> my ($n,$pu,$ps) = (0,0,0);
>
> foreach my $w (@word_list) {
>   my $result;
>   my $t = timeit(2000, sub { ($result) = snowball('english',$w) } );
>   $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> }
> printf  "SNOWBALL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Word by word (original benchmark)
> my ($n,$pu,$ps) = (0,0,0);
> foreach my $w (@word_list) {
>   my $result;
>   my $t = timeit(2000, sub { ($result) = stem($w) } );
>   $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5];
> }
> printf  "ORIGINAL: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf  "BATCHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $n;
>
> # Processed in batch instead of one by one, with caching turned on
> stem_caching({ -level => 2});
> my ($n,$pu,$ps) = (0,0,0);
> my $t = timeit(2000, sub { ($result) = stem(@word_list) } );
> $pu+=$t->[1]; $ps+=$t->[2]; $n+=$t->[5] * $s;
> printf  "CACHED: Average random cross-sectional stem rate for $s words: %5.2f Hz (n=%d).\n", $n/($pu+$ps), $
>
>
>

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83