[Snowball-discuss] Can regular expressions be used to implement Porter Stemmer

Martin Porter martin_porter@softhome.net
Fri Jun 6 10:17:02 2003


No, Snowball doesn't support regexes. The idea of Snowball was to design a
system in sympathy with natural language morphology, rather than build it
out of the standard tools of computer scientists. Furthermore, I can't speak
for all regex interpreters, but you'll find the ANSI C stemmers that come
out of Snowball very much faster than Perl equivalents - and speed is an
important issue in the use of stemmers.

I'm not sure regexes help that much in stemming. An important example of
their misuse are the patterns of the form

    ((C)*((V)+(C)+)+(V)*)

in the description of the Porter stemmer in Baeza-Yates and Ribeiro-Neto's
Modern Information Retrieval. They applies these patterns to the whole word
before ending removal, rather than the resulting stem, so in fact they get
the description of the stemmer wrong. Introducing these regexes seems to be
making the definition more rigorous, but has the opposite effect.


At 11:26 05/06/2003 +0200, Sven Neumann wrote:
>Something less deep I have been wondering about is if snowball supports
>regular expressions. I went through a lot of trouble of porting Perl5
>compatible regexes for my own programming language, and it seemed the
>easiest way to implement Porter. But snowball doesn't do that either,
>and it's designed for stemmers as such. Is there a specific reason for
>this, and if not, are there samples of a regex porter stemmer?
>
>Thank you,
>
>Sven Neumann
>
>PS: Please do not use html formatted mail, as it mixes  badly in the
>replies.
>
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss@lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>