[Snowball-discuss] The Great Stemmer Enumeration Challenge

Wed, 10 Apr 2002 13:44:28 -0500

Forwarding to the list... Modified version of my original email.

----------  Forwarded Message  ----------
From: Allan Fields <afieldscom@idirect.ca>
Subject: The Great Stemmer Enumeration Challenge
Date: Fri, 5 Apr 2002 06:55:38 -0500
To: martin@tartarus.org

Hi,

The following is a list of stemmers I spotted...  It's unbelievably hard to
keep track of them all (-- and these are just the Perl stemmers).  [snip]
I guess it would come as no suprise then, that I was actually planning to
implement YAS (Yet Another Stemmer) to add to the mass.  If possible, I would 
like to avoid this unnecessary branching, and so I'll try to find existing 
implementations to contribute to.  However, with so many to pick from, I've 
been at a loss at which to employ.

So the challenge begins... Join in to The Great Stemmer Enumeration
Challenge!  Come one, come all, bring your Stemmer spottings and a magnifying
glass. [snip - bad joke]

** Perl Stemmers:

1.
- Filename:	perl.txt
- URL:		http://www.tartus.org/~martin/PorterStemmer/
- Package:	(undef)
- Description:	Martin Porter's Perl official/reference stemmer
- Date:		1990 onward?
- Commentary:	[snip]
- Strength:	Accuracy.

2.
- Filename:	porter.pm
- URL:		http://www.ldc.usb.ve/~vdaniel/porter.pm
- Package:	(undef)
- Description:	Daniel van Balen's Perl version
- Date:		October-1999
- Commentary:	Conditionals for speed improvement, lots of of repetition.
- Strength:	Speed?

3.
- Filename:	stem.pl
- URL:		http://www.cpan.org
- Package:	(undef)
- Description:	Ian Phillipps' WAIS stemmer.c derivative.
- Date:		?
- Commentary:	.
- Strength:	Simplicity, Flowingness.

4.
- Filename:	English.pm
- URL:		http://www.cpan.org
- Package:	Text::English
- Description:	Modularized version of stem.pl by Ulrich Pfeifer.
- Date:		Thu Feb  1 13:47:58 1996
- Commentary:	Bad placement -- belongs in Lingua::En on CPAN
- Strength:	Simple.  It actually has a package name and is CPAN-friendly.

5.
- Filename:	Stem.pm, En.pm
- URL:		http://www.cpan.org
- Package:	Lingua::Stem
- Description:	A more complete approach to Perl stemming.  May have been
branched off of Text::English then moved to Text::Stem before being moved to
Lingua::Stem. Jim Richardson, University of Sydney <imr@maths.usyd.edu.au>
and Benjamin Franz <snowhare@nihongo.org>.
- Date:		1999, 2000 fixed missing rules
-Commentary:	Uses strange symbolic reference subroutine calls.  Is this
really necessary?  (Why?)  Assumes US English?
- Strength:	Caches results, allows exceptions, OO interface.  Multiple
languages may be supported in future.

6.
- Filename:	?
- URL:		http://www.cpan.org
- Package:	ROADS::Porter
- Description:	A class to perform stemming using the Porter algorithm. (UK -
eLib/ROADS/DESIRE Library Project)
- Date:		1988??
- Commentary:	Haven't even looked yet.
- Strength:	Hell if I know. :)

7.
- Filename:	perl.tgz
- URL:		http://snowball.sourceforge.net
- Package:	Lingua::Stem::Snowball
- Description:	A perl wrapper for snowball stemmer
- Date:		2002
- Commentary:	Haven't tested yet.
- Strength:	Probably a more logical approach than porting the stemmer 
directly to Perl.  Uses XS?  The performance gains may be significant.

8.
- Filename:	Stemmer.pm
- URL:		Not yet released
- Package:	Yet another stemmer, not decided
- Description:	My attempt at implementing the Porter Stemming algorithm.. YET
AGAIN for my first time :)...  Also including other custom features for
different types of word stemming.
- Date:		2002
- Commentary:	Oh no! Not another one...
- Srength:	I know the author.  The author might attempt to make it into a GUR 
(Grand Unified Regex) in Perl if that is even possible, just for the sake of 
obfuscation.  Or I'll leave that to japhy.  (hehe... He'll do anything if it 
involves a regex challenge.)

** Other:

1.
- URL: http://www.cogsci.princeton.edu/~wn/
- Commentary: What about Wordnet at Princeton.  Do they use it too in their
morphy thing?  Is wordnet cool or what? :)

How does Wordnet fit into the idea of dictionary based stemming?  It might 
make sense to supplement the stemmer with Wordnet like lexical information 
derived from current day English usage (and/or multiple dialects) to avoid 
mis-stemming or over-stemming.  Martin, your paper covers this idea 
thoroughly in section 3 "Stemming errors, and the use of dictionaries".  It 
serves to clarify that no algorithm is going to be perfect, which I was going 
to raise in another email with regards to all the -ing exceptions and how it 
doesn't seem like a purely algorithmic stemmer can tackle those without 
exceptions.

Has anyone created a comprehensive exceptions dictionary for stemming, or a 
starting point for one that could perhaps be interfaced with the stemmer?  If 
not, perhaps the fellows at Princeton would be interested in integrating this 
with Wordnet somehow.

Another idea is to simply create a new database structure with relational 
aspects of words and particles creating a list of capabilities or composition 
rules.  It would make sense to tie this into all Snowball target languages, 
whenever possible.  The Perl Lingua::Stem module currently maintains it's own 
configurable exception list for instance.  If this was centralized into a 
proper database (Berkeley DB probably) that integrated exceptions/relations 
it would improve overall cross-platform accessibility.  But this may not be 
directly the responsibility of the stemmer?

With the stemmer so far, the principal has been to stay as close as possible 
to simple (dictionary) common words, and to avoid chopping off too much or 
too little in the process of coming up with a stem.  But maybe in the future 
this can be presented as an option that allows a gradient of decomposition.  
At some point a stemmer could potentially cross over into the role of a word 
root finder, but that would require prefix stripping, which wouldn't 
necessarily fit into the idea of a stemmer.  Then it gets into the 
etymological scope as is mentioned in the paper.

Also I acknowledge the imperfections with stemming and for that matter all 
other forms of natural language processing.  The language just wasn't 
designed to be easily processed by computers, after all the whole idea of 
language in computer applications is relatively new (in the grand scheme of 
things.)  =)

-- Allan Fields

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss