[Snowball-discuss] Update on regex approach

Allan Fields afieldsml@idirect.ca
Wed, 8 May 2002 12:56:11 -0400


Hi,

Sorry I haven't dropped by for a while, but I'm quite busy.  I'll try to get 
my updated Perl stemmer out with-in the next month.  More benchmarking to 
come.  =)  Biggest issue is with overhead of multiple words -- perl can be a 
real beastie performance wise I've witnessed.

My other attempt to speed up the Perl stemmer that I've also been working on 
is stuck on a few technical details of the measure of words.  One idea I've 
had is to separate finding the measure from the main transform stage by using 
a reduced set representation in deriving the measure while using a single 
regular expression in substitution with supporting inline logic.  s///e  The 
biggest issue with this approach, is that at different points it in necessary 
to look-behind to see if the new measure has changed or is past a minimal 
boundry.  If there was a way to use integers to represent the logic of the 
{c, v, C, V} sequences, it might significantly speed up that stage by making 
the operations integer operations instead.  I would consider this more 
optimal in that, by forcing larger memory usage (still paltry on todays 
computers), it would be possible to conserve processor time.

Also, by inlining all the logic to a single substitution, it could be said 
that perl's larger overhead is reduced somewhat.  Now I'm not sure it would 
compare to the C version, but I'm postulating it will be significantly faster 
than most other approaches in Perl.  (Although it won't be as algorithmic 
moving lots of the procedural elements to the regex itself.)

This has lead me to believe that it may be possible to create a snowball 
compiler that creates stemmers using Perl regexes at most and at the least 
using sed for instance.  There are lots of options for snowball compilation 
currently, but it would have a special geek appeal to make this in sed.  Some 
one, please do beat me to it! ;)

Allan


_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss