[Snowball-discuss] Recognise/ize stemming inconsistency?

Martin Porter martin at porterloo.wanadoo.co.uk
Fri Nov 14 16:47:24 GMT 2008



Piers,

This "ise/ize" debate often re-emerges. The essential point is that "ise" as
an included ending does too much damage to the many words ending "ise", but
for which "ise" is not a suffix: enfranchise, otherwise, paradise, imprecise
and so on. Here is an answer I sent to Andre MacQuaid on 22 Feb 2001.

------------------------------------------

Re: Stemming American English vs. English


Dear André,

I don't think you need worry too much about English/American spelling
differences, as far as the Porter stemming algorithm is concerned. The main
difference is that -ize and -ise endings are (as you note) applied
differently in American and English usage, and the algorithm treats -ize as
an ending but not -ise.

Many people have adapted the algorithm by adding -ise to the list of
endings, but on balance I think that is a mistake. There are too many words
ending -ise where -ise should not be removed.

American spelling is much more logical than English, and -ize/-ise usage is
no exception. So in fact the Porter stemmer probably does better with
American English than with English English!

As a matter of fact -ize usage in England used to be much closer to the
American style than it now is. Here are Thackeray's -ize endings from Vanity
Fair (published 1847):

agonized
apologize apologized
authorized
capitalized
characterize
cicatrized
civilized
harmonized
idolizes
particularize
patronize patronized patronizes
proselytizer
realize realized
recognize recognized
tyrannize tyrannized
victimized victimizer

Today many of these words would have to be spelled -ise in England, e.g.
characterise, realise, recognise ....

Hope this helps,

Martin





At 14:04 14/11/2008 +0000, Piers Taylor wrote:
>Dear Martin,
>
>I am working on a PHP version of your Porter2 Stemmer.
>
>I came across the following results and then checked
>the Diffs file and found them to be as per your code:
>
>	recognise	recognis
>	recognised	recognis
>	recognising	recognis
>	recognition	recognit
>	recognize	recogn
>	recognized	recogn
>	recognizes	recogn
>	recognizing	recogn
>
>Since the word:
>
>	recognise == recognize
>
>and so on, I would therefore expect them to stem
>to the same thing, since indexing mixed UK/American
>documents might well contain either.
>
>Likewise, I would expect:
>
>	recognition -> recogn OR recognis
>
>I also note similar queries with the following words:
>
>	apologise, criticise, organise, patronise, sympathise,
>	scrutinising, tantalising
>
>There may be others, but these were highlighted by
>some unit tests I am working on.
>
>I understand that stemming is not a totally
>exact science, and would welcome your comments
>on the above observations.
>
>With best regards,
>		   Piers
>
>Piers Taylor
>01752 822572
>07815 155301
>piers-taylor at 2vu.com
>
>





More information about the Snowball-discuss mailing list