[Snowball-discuss] Consider changing the suffix -version to -vert rather than to -verse
Tolkin, Steve
Steve.Tolkin at FMR.COM
Fri Dec 5 22:25:55 GMT 2008
Dear Martin,
Your paper was very interesting and enlightening. I have a few
questions and some feedback.
Some of the paper uses actual Snowball notation to describe a stemmer, a
Lovins-style Porter stemmer. But that stemmer is not listed at
http://snowball.tartarus.org/ Does that stemmer actually exist?
If so, please post it. If not, how much work remains and do you have
any plans to finish it?
You call our attention to the section on second conjugation form, and so
I focus on that.
The table shown in the paper seems incomplete.
For example, "pro" is not in the list of prefixes for the suffix "cess".
This omission is
striking considering there is an explicit final_respell rule to change
"proced" to "proceed".
If you ever revise the paper, and/or create that Lovins-style Porter
stemmer, you might want to make the following changes. (All the above
is "off the top of my head". I have not written a program that might
find additional worthwhile prefixes.)
1. New prefixes for existing rows:
Suffix Prefix
cess pro
clus se
lus col
trus in
vas in per
vers sub
2. Add 2 entirely new rows:
'tent' (<-'tain') // de re
'tent' (<-'tend') // con dis ex in
Note that this last set of 2 rules shows that the same "suffix" or
"ending"
should be treated differently depending on the prefix. (Is there
another
example of that already? There is not one in this table.) In fact the
suffix
"tent" has a third way of being treated, namely preserved as "tent",
when the
prefix is "in". Adding the last line above will conflate intention with
intension. This would be bad for philosophers and others who use these
terms
in distinct ways, but will benefit most people. Extention is most often
a
spelling variant of extension, and so these should be conflated.
Here is the table with my suggestions added:
define second_conjugation_form as (
[substring] prefix among (
'cept' (<-'ceiv') //-e con de re
'cess' (<-'ced') //-e con ex inter pre pro re se suc
'cis' (<-'cid') //-e de (20)
'clus' (<-'clud') //-e con ex in oc se (26)
'curs' (<-'cur') // re (6)
'dempt' (<-'deem') // re
'duct' (<-'duc') //-e de in re pro (3)
'fens' (<-'fend') // de of
'hes' (<-'her') //-e ad (28)
'lis' (<-'lid') //-e e col (21)
'lus' (<-'lud') //-e al col de e
'miss' (<-'mit') // ad com o per re sub trans (29)
'pans' (<-'pand') // ex (23)
'plos' (<-'plod') //-e ex
'prehens' (<-'prehend') // ap com
'ris' (<-'rid') //-e de (22)
'ros' (<-'rod') //-e cor e
'scens' (<-'scend') // a
'script' (<-'scrib') //-e de in pro
'solut' (<-'solv') //-e dis re (8)
'sorpt' (<-'sorb') // ab (5)
'spons' (<-'spond') // re (25)
'sumpt' (<-'sum') // con pre re (4)
'suas' (<-'suad') //-e dis per (18)
'tens' (<-'tend') // ex in pre (24)
'tent' (<-'tain') // de re
'tent' (<-'tend') // con dis ex in
'trus' (<-'trud') //-e in ob (27)
'vas' (<-'vad') //-e e in per (19)
'vers' (<-'vert') // a con e in re sub (31)
'vis' (<-'vid') //-e di pro
)
)
P.S. If you want I could also send the output of diff on the original
and proposed .
P.P.S. Some brand new prefixes should be added to the rule for the
suffix vers,
e.g., extra, extro, and intro, and maybe also recon.
Hopefully helpfully yours,
Steve
--
Steven Tolkin
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.
-----Original Message-----
From: Martin Porter [mailto:martin at porterloo.wanadoo.co.uk]
Sent: Monday, November 24, 2008 10:16 AM
To: Tolkin, Steve; snowball-discuss at lists.tartarus.org
Subject: Re: [Snowball-discuss] Consider changing the suffix -version to
-vert rather than to -verse
Steve (and everybody on the mailing list...),
Thank you for the carefully considered email.
I have put up some extra material at,
http://snowball.tartarus.org/algorithms/lovins/festschrift.html
Search for the string second_conjugation_form and read the surrounding
text.
This adds quite a bit of context to your observation. You will see that
the
whole seemingly complex area of handling stem variations caused by
change in
the participle forms of second conjugation Latin verbs can be handled
fairly
easily in snowball. (I know that sounds very technical, but you'll see
the
intention as soon as you read the extract from the paper.) It deals with
the
case you mention, since the rule would be to change vert to vers after
-ive
or -ion removal.
You may very well ask, if I did this work seven years ago, why is not
part
of the snowball Porter2 algorithm?
The answer really is I don't think now that further elaboration of the
algorithms is the way to go. There are after all many worse problems
than
the one you mention: 'university' and 'universe' going to the same stem
for
example. And no matter how large the exception lists, the algorithms can
never be "perfect", or even "optimal". This is because the degree of
conflation useful in an IR task (typically, during retrieval through an
index) varies with the task. In particular, different types of query
benefit
from different degrees of stemming.
Something else I've become aware of is that collection size affects the
usefulness of stemming. If you have 1 million articles (news stories
maybe),
it doesn't matter that 'determine', "deterministic', 'determinant'
conflate.
If you have a billion articles it matters a great deal.
I now tend to regard the stemmers as tools on the way to forming
conflation
classes of words that are useful in a particular retrieval context.
Perhaps
the snowball website needs to reflect that more.
Meanwhile it is important that little gems like second_conjugation_form
should be made more visible on the site, and I'll see what I can do
about that,
Martin
P.S. I'm not ruling out us adding in your suggested change. That could
happen. I just want to set it in the context outlined above.
>
>Summary:
>The Porter stemmer should consider changing the suffix -version to
>-vert. Currently it changes it to -verse. This would prevent several
>bad conflations, e.g., conversion with conversation, and produce good
>conflations, e.g. conversion with convert.
>
>Details:
>Here is what the Porter stemmer does (via the Snowball demo at
>http://snowball.tartarus.org/demo.php )
>converse -> convers
>convert -> convert
>conversion -> convers
>conversation -> convers
>
>I think there are several problems with this, e.g.
>* conversation should not match conversion (a bad conflation, aka false
>positive, or false match)
>* conversion should match convert (a false negative or false miss).
>
>The easiest way to fix both of these problems (and others) would be to
>change the stemmer to produce:
> conversion -> convers
>rather than the current:
> conversion -> convert
>
>Generalizing this, I think that all words that end with -version (after
>removing other suffixes such as -s, -ing, -al, -ed, etc.) should be
>stemmed to end with -vert. In my analysis this never causes a bad
>conflation. (This would miss a few good conflations, but overall it
>would be much better than the current approach.)
>
>Here is the complete list of 22 words that end -version in yawl.lst,
and
>which have at least one preceding letter (to avoid version itself).
>There would be more if we include other suffixes.
>
>ambiversion animadversion anteversion antisubversion aversion
>bioconversion conversion diversion eversion extraversion extroversion
>interconversion introversion inversion obversion perversion
reconversion
>retroversion reversion seroconversion subversion transversion
>
>19 of these 22 also exist in yawl.lst when I replace the final -version
>with a -vert. The exceptions are antisubversion, bioconvert, and
>transvert. Note that these simply fail to find a match; there are no
>false positives and no false negatives.
>
>In contrast, changing the last -version to -verse (in essence what is
>done currently; the stemmer actually changes the final -se to just -s)
>produces a much smaller set:
>
>averse converse diverse inverse obverse perverse reverse subverse
>transverse
>
>The current approach has both false negatives and false positives. For
>example, I think the first word is much closer semantically to the
third
>than to the second, in the table below.
>
>Input Current Proposed
>conversion converse convert
>diversion diverse divert
>reversion reverse revert
>subversion subverse subvert
>
>Similarly, but less forcefully, for many of the others.
>
>In conclusion:
>It is possible that Martin Porter and others will decide that it is not
>worth making the change, in part because few words are affected.
>However, some of these words and the variants with other suffixes) are
>quite common. I also suggest this thought experiment. Suppose the
>proposed change had been the behavior of the algorithm from the
>beginning. If someone now suggested stemming -version to -verse to be
>consistent with the other -sion endings wouldn't people complain that
>the algorithm should not be made worse.
>
>
>Hopefully helpfully yours,
>Steve
>
>--
>Steven Tolkin
>
>All opinions expressed are my own, and not that of my employer.
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss at lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
More information about the Snowball-discuss
mailing list