[Snowball-discuss] English stemming: -ise vs. -ize (revisited)

Fri Mar 19 09:39:14 GMT 2010

Hi everyone,

While I'm new here I've been using an implementation of Porter 2 for the open source CMS/CMF Drupal (http://drupal.org, http://drupal.org/project/porterstemmer) for a while.

I was going to ask a question about inconsistent treatment of -ise and -ize but found this useful thread in the November 2008 archives:
http://lists.tartarus.org/mailman/private/snowball-discuss/2008-November/001063.html

So instead here are a few rambling thoughts from a beginner in this field. My interest is in making the search function on the 1 or 2 websites I run work a bit better.

I was looking at this excerpt from the sample vocab and its stemmed equivalent:
apologetic      -> apologet
apologetically  -> apologet
apologies       -> apolog
apologise       -> apologis
apologised      -> apologis
apologising     -> apologis
apologists      -> apologist
apologize       -> apolog
apologized      -> apolog
apologizes      -> apolog
apologizing     -> apolog
apology         -> apologIn this case one might want apologis to reduce to apolog ... Also one might sometimes want apologist or apologet to be "connected"/related somehow to the stem apolog...

Having slept on all of this for a bit, it seems to me that going any further than the Porter 2 algorithm currently does would be at least an order of magnitude increase in its complexity.. to correctly connect apolog with apologis would probably require a dictionary/lookup approach (essentially detailing when -ise *can* (or cannot) indeed be complelely stripped). And then relating apologist or apologet to apolog (while allowing that they are not as close a match as all the words that stem to apolog) requires another order of magnitude increase in complexity or perhaps 2 - firstly we'd need more dictionary lookups and then when doing a search (e.g. for apologizing) we'd need to know that we matched to e.g. apologet(ic) rather than to apolog in order to score this lower than matching another word that stems to apolog.

Is this of relevance to an open source approach to stemming/IR? Well, computing power has increased somewhat since the original Porter algorithm which will be celebrating its 30th birthday this year, so more is certainly possible. And I imagine that the likes of Google are doing all this, and more. Except that of course what they are doing is closed source (I wonder if they started from Porter..!!).

So for the likes of me, people who just want search to work or work better on their websites, would it be fair to say there's quite a lot to be done to advance the (open source) state of the art beyond where we are at present? Or is work already ongoing in projects like Lucene (which, as I understand it, already includes Porter)?

Belatedly, I should here thank Martin and everyone who has helped make the Porter 2 algorithm as effective as it already is. In fact it completely transforms the built-in site search on the sites in question. And downloading and installing a Drupal module is something I can cope with; Lucene/ApacheSolr is probably rather beyond my capabilities..! It's all too easy to want everything for no "cost"..

Best,
Giles Kennedy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100319/4f3dad6e/attachment.htm>