[Snowball-discuss] Consider changing the suffix -version to -vert rather than to -verse

Tolkin, Steve Steve.Tolkin at FMR.COM
Wed Nov 19 00:42:38 GMT 2008


Summary:
The Porter stemmer should consider changing the suffix -version to
-vert.  Currently it changes it to -verse.  This would prevent several
bad conflations, e.g., conversion with conversation, and produce good
conflations, e.g. conversion with convert. 

Details:
Here is what the Porter stemmer does (via the Snowball demo at
http://snowball.tartarus.org/demo.php )
converse -> convers
convert -> convert
conversion -> convers
conversation -> convers

I think there are several problems with this, e.g.
* conversation should not match conversion (a bad conflation, aka false
positive, or false match)
* conversion should match convert (a false negative or false miss).  

The easiest way to fix both of these problems (and others) would be to
change the stemmer to produce: 
   conversion -> convers
rather than the current:
   conversion -> convert

Generalizing this, I think that all words that end with -version (after
removing other suffixes such as -s, -ing, -al, -ed, etc.) should be
stemmed to end with -vert.  In my analysis this never causes a bad
conflation.  (This would miss a few good conflations, but overall it
would be much better than the current approach.)

Here is the complete list of 22 words that end -version in yawl.lst, and
which have at least one preceding letter (to avoid version itself).
There would be more if we include other suffixes.

ambiversion animadversion anteversion antisubversion aversion
bioconversion conversion diversion eversion extraversion extroversion
interconversion introversion inversion obversion perversion reconversion
retroversion reversion seroconversion subversion transversion

19 of these 22 also exist in yawl.lst when I replace the final -version
with a -vert.  The exceptions are antisubversion, bioconvert, and
transvert.  Note that these simply fail to find a match; there are no
false positives and no false negatives.
 
In contrast, changing the last -version to -verse (in essence what is
done currently; the stemmer actually changes the final -se to just -s)
produces a much smaller set:

averse converse diverse inverse obverse perverse reverse subverse
transverse

The current approach has both false negatives and false positives.  For
example, I think the first word is much closer semantically to the third
than to the second, in the table below.

Input       Current    Proposed
conversion  converse   convert
diversion   diverse    divert
reversion   reverse    revert
subversion  subverse   subvert

Similarly, but less forcefully, for many of the others.   

In conclusion:
It is possible that Martin Porter and others will decide that it is not
worth making the change, in part because few words are affected.
However, some of these words and the variants with other suffixes) are
quite common. I also suggest this thought experiment.  Suppose the
proposed change had been the behavior of the algorithm from the
beginning.  If someone now suggested stemming -version to -verse to be
consistent with the other -sion endings wouldn't people complain that
the algorithm should not be made worse. 


Hopefully helpfully yours,
Steve

--
Steven Tolkin 

All opinions expressed are my own, and not that of my employer.




More information about the Snowball-discuss mailing list