[Snowball-discuss] Consider changing the suffix -version to -vert rather than to -verse

Martin Porter martin at porterloo.wanadoo.co.uk
Mon Nov 24 15:15:36 GMT 2008


Steve (and everybody on the mailing list...),

Thank you for the carefully considered email.

I have put up some extra material at,

http://snowball.tartarus.org/algorithms/lovins/festschrift.html

Search for the string second_conjugation_form and read the surrounding text.
This adds quite a bit of context to your observation. You will see that the
whole seemingly complex area of handling stem variations caused by change in
the participle forms of second conjugation Latin verbs can be handled fairly
easily in snowball. (I know that sounds very technical, but you'll see the
intention as soon as you read the extract from the paper.) It deals with the
case you mention, since the rule would be to change vert to vers after -ive
or -ion removal.

You may very well ask, if I did this work seven years ago, why is not part
of the snowball Porter2 algorithm?

The answer really is I don't think now that further elaboration of the
algorithms is the way to go. There are after all many worse problems than
the one you mention: 'university' and 'universe' going to the same stem for
example. And no matter how large the exception lists, the algorithms can
never be "perfect", or even "optimal". This is because the degree of
conflation useful in an IR task (typically, during retrieval through an
index) varies with the task. In particular, different types of query benefit
from different degrees of stemming.

Something else I've become aware of is that collection size affects the
usefulness of stemming. If you have 1 million articles (news stories maybe),
it doesn't matter that 'determine', "deterministic', 'determinant' conflate.
If you have a billion articles it matters a great deal. 

I now tend to regard the stemmers as tools on the way to forming conflation
classes of words that are useful in a particular retrieval context. Perhaps
the snowball website needs to reflect that more.

Meanwhile it is important that little gems like second_conjugation_form
should be made more visible on the site, and I'll see what I can do about that,

Martin
  
P.S. I'm not ruling out us adding in your suggested change. That could
happen. I just want to set it in the context outlined above.


 
>
>Summary:
>The Porter stemmer should consider changing the suffix -version to
>-vert.  Currently it changes it to -verse.  This would prevent several
>bad conflations, e.g., conversion with conversation, and produce good
>conflations, e.g. conversion with convert. 
>
>Details:
>Here is what the Porter stemmer does (via the Snowball demo at
>http://snowball.tartarus.org/demo.php )
>converse -> convers
>convert -> convert
>conversion -> convers
>conversation -> convers
>
>I think there are several problems with this, e.g.
>* conversation should not match conversion (a bad conflation, aka false
>positive, or false match)
>* conversion should match convert (a false negative or false miss).  
>
>The easiest way to fix both of these problems (and others) would be to
>change the stemmer to produce: 
>   conversion -> convers
>rather than the current:
>   conversion -> convert
>
>Generalizing this, I think that all words that end with -version (after
>removing other suffixes such as -s, -ing, -al, -ed, etc.) should be
>stemmed to end with -vert.  In my analysis this never causes a bad
>conflation.  (This would miss a few good conflations, but overall it
>would be much better than the current approach.)
>
>Here is the complete list of 22 words that end -version in yawl.lst, and
>which have at least one preceding letter (to avoid version itself).
>There would be more if we include other suffixes.
>
>ambiversion animadversion anteversion antisubversion aversion
>bioconversion conversion diversion eversion extraversion extroversion
>interconversion introversion inversion obversion perversion reconversion
>retroversion reversion seroconversion subversion transversion
>
>19 of these 22 also exist in yawl.lst when I replace the final -version
>with a -vert.  The exceptions are antisubversion, bioconvert, and
>transvert.  Note that these simply fail to find a match; there are no
>false positives and no false negatives.
> 
>In contrast, changing the last -version to -verse (in essence what is
>done currently; the stemmer actually changes the final -se to just -s)
>produces a much smaller set:
>
>averse converse diverse inverse obverse perverse reverse subverse
>transverse
>
>The current approach has both false negatives and false positives.  For
>example, I think the first word is much closer semantically to the third
>than to the second, in the table below.
>
>Input       Current    Proposed
>conversion  converse   convert
>diversion   diverse    divert
>reversion   reverse    revert
>subversion  subverse   subvert
>
>Similarly, but less forcefully, for many of the others.   
>
>In conclusion:
>It is possible that Martin Porter and others will decide that it is not
>worth making the change, in part because few words are affected.
>However, some of these words and the variants with other suffixes) are
>quite common. I also suggest this thought experiment.  Suppose the
>proposed change had been the behavior of the algorithm from the
>beginning.  If someone now suggested stemming -version to -verse to be
>consistent with the other -sion endings wouldn't people complain that
>the algorithm should not be made worse. 
>
>
>Hopefully helpfully yours,
>Steve
>
>--
>Steven Tolkin 
>
>All opinions expressed are my own, and not that of my employer.
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss at lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>





More information about the Snowball-discuss mailing list