[Snowball-discuss] Small changes to English stemmer
Tolkin, Steve
Steve.Tolkin at FMR.COM
Fri Jan 13 15:42:49 GMT 2006
1. I don't understand what problem the first change (for ied and ies) is
intended to solve.
I think nowadays the most likely usage of "ied" is "improvised explosive
device".
Stemming this to "ie" is no better than, and perhaps worse than,
producing "i".
Perhaps the best treatment is to leave it alone, as "ied", so it will
conflate with "ieds".
The most likely use of "ie" (after "i.e." written without the periods)
is for Internet Explorer.
But this will be rarely spelled ies. The most likely usage of "ies" is
as an acronym. Google finds 16 million hits and the first 100 are all
acronyms. So again perhaps just leave it alone.
2. The most frequent use of a leading Y as vowel is in proper names,
e.g., Yvonne (13 M hits) and Yvette (5 M). But I do not think these are
affected by the second change, still producing:
yvonne -> yvonn
yvette -> yvett
Hopefully helpfully yours,
Steve
---
Steven Tolkin
There is nothing so practical as a good theory. Comments are by me, not
Fidelity Investments, its subsidiaries or affiliates.
-----Original Message-----
From: snowball-discuss-bounces at lists.tartarus.org
[mailto:snowball-discuss-bounces at lists.tartarus.org] On Behalf Of
martin.porter at grapeshot.co.uk
Sent: Monday, January 09, 2006 5:24 AM
To: Snowball Discuss
Subject: [Snowball-discuss] Small changes to English stemmer
There have been two small changes to the English (Porter2) stemming
algorithm.
The first is that the Rule
ied ies
replace by ie if preceded by just one letter, otherwise by i
has been changed to
ied ies
replace by i if preceded by more than one letter, otherwise by
ie
There is a corresponding change in the Snowball script:
'ied' 'ies'
((next atlimit <-'ie') or <-'i')
'ied' 'ies'
((hop 2 <-'i') or <-'ie')
This ONLY affects the two 'words' ied and ies. Formerly they stemmed to
i, now
they stem to ie.
The second is that the line,
do ( ['y'] v <-'Y' set Y_found)
which did not match the Rule
Set initial y ... to Y,
has been changed to
do ( ['y'] <-'Y' set Y_found)
which does.
(The problem was whether to make the rule match the coding or the coding
match
the rule. The point is that in English initial y, when followed by
consonant,
is a vowel, but that only archaic words have this shape:- yclept and so
on. I
have decided to keep things simple and treat initial y as a consonant in
all
cases.)
Both these changes are trivial.
There is a rule to remove initial apostrophe in the stemmer, which I
have
come to
think is a bit feeble, but it can be left in for now.
Martin
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss at lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
More information about the Snowball-discuss
mailing list