[Snowball-discuss] Recognise/ize stemming inconsistency?
Piers Taylor
piers-taylor at 2vu.com
Mon Nov 17 15:28:17 GMT 2008
Dear Martin,
Thank you for your reply. My experiments
with the code led me to a similar conclusion
re 'damage'.
Your comments are most helpful and I was sure
you would be able to put my mind at rest.
Thanks again!
Best regards,
Piers
Piers Taylor
01752 822572
07815 155301
piers-taylor at 2vu.com
Piers Taylor
01752 822572
07815 155301
piers-taylor at 2vu.com
On 14 Nov 2008, at 16:47, Martin Porter wrote:
>
>
> Piers,
>
> This "ise/ize" debate often re-emerges. The essential point is that
> "ise" as
> an included ending does too much damage to the many words ending
> "ise", but
> for which "ise" is not a suffix: enfranchise, otherwise, paradise,
> imprecise
> and so on. Here is an answer I sent to Andre MacQuaid on 22 Feb 2001.
>
> ------------------------------------------
>
> Re: Stemming American English vs. English
>
>
> Dear André,
>
> I don't think you need worry too much about English/American spelling
> differences, as far as the Porter stemming algorithm is concerned.
> The main
> difference is that -ize and -ise endings are (as you note) applied
> differently in American and English usage, and the algorithm treats -
> ize as
> an ending but not -ise.
>
> Many people have adapted the algorithm by adding -ise to the list of
> endings, but on balance I think that is a mistake. There are too
> many words
> ending -ise where -ise should not be removed.
>
> American spelling is much more logical than English, and -ize/-ise
> usage is
> no exception. So in fact the Porter stemmer probably does better with
> American English than with English English!
>
> As a matter of fact -ize usage in England used to be much closer to
> the
> American style than it now is. Here are Thackeray's -ize endings
> from Vanity
> Fair (published 1847):
>
> agonized
> apologize apologized
> authorized
> capitalized
> characterize
> cicatrized
> civilized
> harmonized
> idolizes
> particularize
> patronize patronized patronizes
> proselytizer
> realize realized
> recognize recognized
> tyrannize tyrannized
> victimized victimizer
>
> Today many of these words would have to be spelled -ise in England,
> e.g.
> characterise, realise, recognise ....
>
> Hope this helps,
>
> Martin
>
>
>
>
>
> At 14:04 14/11/2008 +0000, Piers Taylor wrote:
>> Dear Martin,
>>
>> I am working on a PHP version of your Porter2 Stemmer.
>>
>> I came across the following results and then checked
>> the Diffs file and found them to be as per your code:
>>
>> recognise recognis
>> recognised recognis
>> recognising recognis
>> recognition recognit
>> recognize recogn
>> recognized recogn
>> recognizes recogn
>> recognizing recogn
>>
>> Since the word:
>>
>> recognise == recognize
>>
>> and so on, I would therefore expect them to stem
>> to the same thing, since indexing mixed UK/American
>> documents might well contain either.
>>
>> Likewise, I would expect:
>>
>> recognition -> recogn OR recognis
>>
>> I also note similar queries with the following words:
>>
>> apologise, criticise, organise, patronise, sympathise,
>> scrutinising, tantalising
>>
>> There may be others, but these were highlighted by
>> some unit tests I am working on.
>>
>> I understand that stemming is not a totally
>> exact science, and would welcome your comments
>> on the above observations.
>>
>> With best regards,
>> Piers
>>
>> Piers Taylor
>> 01752 822572
>> 07815 155301
>> piers-taylor at 2vu.com
>>
>>
>
>
More information about the Snowball-discuss
mailing list