[Snowball-discuss] Recognise/ize stemming inconsistency?

Piers Taylor piers-taylor at 2vu.com
Mon Nov 17 15:28:17 GMT 2008


Dear Martin,

Thank you for your reply. My experiments
with the code led me to a similar conclusion
re 'damage'.

Your comments are most helpful and I was sure
you would be able to put my mind at rest.

Thanks again!

Best regards,
		Piers

Piers Taylor
01752 822572
07815 155301
piers-taylor at 2vu.com
Piers Taylor
01752 822572
07815 155301
piers-taylor at 2vu.com



On 14 Nov 2008, at 16:47, Martin Porter wrote:

>
>
> Piers,
>
> This "ise/ize" debate often re-emerges. The essential point is that  
> "ise" as
> an included ending does too much damage to the many words ending  
> "ise", but
> for which "ise" is not a suffix: enfranchise, otherwise, paradise,  
> imprecise
> and so on. Here is an answer I sent to Andre MacQuaid on 22 Feb 2001.
>
> ------------------------------------------
>
> Re: Stemming American English vs. English
>
>
> Dear André,
>
> I don't think you need worry too much about English/American spelling
> differences, as far as the Porter stemming algorithm is concerned.  
> The main
> difference is that -ize and -ise endings are (as you note) applied
> differently in American and English usage, and the algorithm treats - 
> ize as
> an ending but not -ise.
>
> Many people have adapted the algorithm by adding -ise to the list of
> endings, but on balance I think that is a mistake. There are too  
> many words
> ending -ise where -ise should not be removed.
>
> American spelling is much more logical than English, and -ize/-ise  
> usage is
> no exception. So in fact the Porter stemmer probably does better with
> American English than with English English!
>
> As a matter of fact -ize usage in England used to be much closer to  
> the
> American style than it now is. Here are Thackeray's -ize endings  
> from Vanity
> Fair (published 1847):
>
> agonized
> apologize apologized
> authorized
> capitalized
> characterize
> cicatrized
> civilized
> harmonized
> idolizes
> particularize
> patronize patronized patronizes
> proselytizer
> realize realized
> recognize recognized
> tyrannize tyrannized
> victimized victimizer
>
> Today many of these words would have to be spelled -ise in England,  
> e.g.
> characterise, realise, recognise ....
>
> Hope this helps,
>
> Martin
>
>
>
>
>
> At 14:04 14/11/2008 +0000, Piers Taylor wrote:
>> Dear Martin,
>>
>> I am working on a PHP version of your Porter2 Stemmer.
>>
>> I came across the following results and then checked
>> the Diffs file and found them to be as per your code:
>>
>> 	recognise	recognis
>> 	recognised	recognis
>> 	recognising	recognis
>> 	recognition	recognit
>> 	recognize	recogn
>> 	recognized	recogn
>> 	recognizes	recogn
>> 	recognizing	recogn
>>
>> Since the word:
>>
>> 	recognise == recognize
>>
>> and so on, I would therefore expect them to stem
>> to the same thing, since indexing mixed UK/American
>> documents might well contain either.
>>
>> Likewise, I would expect:
>>
>> 	recognition -> recogn OR recognis
>>
>> I also note similar queries with the following words:
>>
>> 	apologise, criticise, organise, patronise, sympathise,
>> 	scrutinising, tantalising
>>
>> There may be others, but these were highlighted by
>> some unit tests I am working on.
>>
>> I understand that stemming is not a totally
>> exact science, and would welcome your comments
>> on the above observations.
>>
>> With best regards,
>> 		   Piers
>>
>> Piers Taylor
>> 01752 822572
>> 07815 155301
>> piers-taylor at 2vu.com
>>
>>
>
>




More information about the Snowball-discuss mailing list