[Snowball-discuss] porter2 stemmer overstemming the letter e

Vincent Li vincent.li at formicary.net
Mon Nov 24 21:12:50 GMT 2008


Hi Martin and Piers, thankyou for your responses, they were both very
helpful.

Interestingly I had a look in the wordnet symonyms and found that some
where word combinations 'common room', 'six pack' and some contained
hyphens 'semi-detached house'.

I guess I'll have to drop all the entries containing non-alpha characters
and apply stemming on the remaining ones for now. Just thought I'd share
this with you - although - I cant help but to feel this is a waste of
wordnet's entries as they were put in by hand. I might try to tokenize and
stem them later on, but that will require a fast way of matching multiple
tokens against the query terms - sounds messy. :)

Kind regards,
Vincent



> Hi Vincent,
>
>  >It would be great for symonym searching if there
>  >was some general rule for putting the letter 'e'
>  >back in to some of these words.
>
> um, I may be missing something here, but shouldn't
> you be stemming the synonym and *then* doing the search?
> That way your synonym will match the index.
> That's the way I would be doing it.
>
> With best regards,
> 		  Piers
>
> Piers Taylor
> 01752 822572
> 07815 155301
> piers-taylor at 2vu.com
>
>
>
> On 24 Nov 2008, at 01:13, Vincent Li wrote:
>
>> Hi there I have a quick question about the porter2 stemmer
>> overstemming
>> the letter 'e' at the end of english words. At a glance, this
>> appears to
>> be quite common as I noticed two from the sample vocab on
>>
>> http://snowball.tartarus.org/algorithms/english/stemmer.html
>>
>> console -> consol
>> conspire -> conspir
>>
>> vintage -> vintag
>>
>>
>> I wont be surprised if there is somthing I am missing here, and
>> would be
>> glad if someone can enlighten me as to why the stemmer does this.
>>
>> I discovered this while I was trying to inject wordnet symonyms into
>> stemmed search queries and noticed the search didnt pickup any
>> symonyms
>> for vintage. I thought about adding this as an exception, but I
>> noticed
>> the two entries in the sample vocabulary on the english stemmer site
>> and
>> thought it might be a common thing.
>>
>> Just here to check if this is more of a feature than a bug really. It
>> would be great for symonym searching if there was some general rule
>> for
>> putting the letter 'e' back in to some of these words. :)
>>
>> Many thanks in advance,
>>
>> Vincent
>>
>> P.S. is there a way to search through the archive of this email list?
>> Apologies if this question was addressed before, I tried but failed to
>> find a search.
>>
>>
>> ----------------------------------------------------------------------------
>> This message is confidential and may be privileged. It is intended
>> solely for
>> the named addressee. If you are not the intended recipient, please
>> inform us.
>> Any unauthorised dissemination, distribution or copying hereof is
>> prohibited.
>>
>> Formicary Limited registered office in England and Wales, address 1
>> Taillar
>> Road, Hedon, East Yorkshire HU12 8GU, registration number 3894343,
>> VAT number
>> 747644304, does not guarantee that the integrity of this
>> communication has been
>> maintained nor that this communication is free of viruses,
>> interceptions or
>> interference.
>> ----------------------------------------------------------------------------
>>
>> _______________________________________________
>> Snowball-discuss mailing list
>> Snowball-discuss at lists.tartarus.org
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>



----------------------------------------------------------------------------
This message is confidential and may be privileged. It is intended solely for
the named addressee. If you are not the intended recipient, please inform us.
Any unauthorised dissemination, distribution or copying hereof is prohibited.

Formicary Limited registered office in England and Wales, address 1 Taillar
Road, Hedon, East Yorkshire HU12 8GU, registration number 3894343, VAT number
747644304, does not guarantee that the integrity of this communication has been
maintained nor that this communication is free of viruses, interceptions or
interference.
----------------------------------------------------------------------------



More information about the Snowball-discuss mailing list