[Snowball-discuss] Update on regex approach

Allan Fields afieldsml@idirect.ca
Thu, 9 May 2002 14:25:29 -0400


Hi,

Oleg, there's no problem with it.. I appreciate the work that went into the 
interfacing code for Perl.  I agree that  the C code is likely the most 
optimal way to do things.  I haven't benchmarked your module yet, but I'm 
assuming it will perform quite well.

However, like I said before, it's nice to have the option of a native 
implementation as well and those who are already employing a 100% perl 
approach should note there are possible variations on implementation that 
would increase performance such as avoiding subroutine overhead and variable 
scoping issues with the garbage collector.  I found there was a significant 
loss in performance when using lexical variables in frequent subroutine 
calls.

Another note is that regular expressions do provide much flexibility to make 
efficient matching code.  My work on the other Perl regular expression 
stemmer implementations may be of some academic interest when they are 
finished, if nothing else. :)

Thanks, Allan


On May 8, 2002 01:33 pm, Oleg Bartunov wrote:
> Allan,
>
> I dont' understand what's the problem to use our Perl interface to
> Snowball  You'll never get performance better than original C program.
>
>
> 	Oleg
>
> On Wed, 8 May 2002, Allan Fields wrote:
> > Hi,
> >
> > Sorry I haven't dropped by for a while, but I'm quite busy.  I'll try to
> > get my updated Perl stemmer out with-in the next month.  More
> > benchmarking to come.  =)  Biggest issue is with overhead of multiple
> > words -- perl can be a real beastie performance wise I've witnessed.
> >
> > My other attempt to speed up the Perl stemmer that I've also been working
> > on is stuck on a few technical details of the measure of words.  One idea
> > I've had is to separate finding the measure from the main transform stage
> > by using a reduced set representation in deriving the measure while using
> > a single regular expression in substitution with supporting inline logic.
> >  s///e  The biggest issue with this approach, is that at different points
> > it in necessary to look-behind to see if the new measure has changed or
> > is past a minimal boundry.  If there was a way to use integers to
> > represent the logic of the {c, v, C, V} sequences, it might significantly
> > speed up that stage by making the operations integer operations instead. 
> > I would consider this more optimal in that, by forcing larger memory
> > usage (still paltry on todays computers), it would be possible to
> > conserve processor time.
> >
> > Also, by inlining all the logic to a single substitution, it could be
> > said that perl's larger overhead is reduced somewhat.  Now I'm not sure
> > it would compare to the C version, but I'm postulating it will be
> > significantly faster than most other approaches in Perl.  (Although it
> > won't be as algorithmic moving lots of the procedural elements to the
> > regex itself.)
> >
> > This has lead me to believe that it may be possible to create a snowball
> > compiler that creates stemmers using Perl regexes at most and at the
> > least using sed for instance.  There are lots of options for snowball
> > compilation currently, but it would have a special geek appeal to make
> > this in sed.  Some one, please do beat me to it! ;)
> >
> > Allan
> >
> >
> > _______________________________________________________________
> >
> > Have big pipes? SourceForge.net is looking for download mirrors. We
> > supply the hardware. You get the recognition. Email Us:
> > bandwidth@sourceforge.net _______________________________________________
> > Snowball-discuss mailing list
> > Snowball-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>
> 	Regards,
> 		Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83


_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss