[Snowball-discuss] Comparing stemmers

Olly Betts olly at survex.com
Tue Oct 17 05:33:26 BST 2023


I've been wanting a way to compare the output from two versions of a
stemmer to aid evaluating proposed changes to algorithms.
time to time.

I tried to write a script to analyse and describe the changes, but
was struggling to find a good way to handle more complex cases.  Then I
hit on the idea of throwing the data for the difficult cases at
graphviz's dot tool and asking it to draw a graph and that turns out to
work nicely.

Here's sample output comparing the original Porter stemmer with the
Snowball "English" stemmer (the latter is derived from the former, but
includes a number of improvements):

https://survex.com/~olly/stemmer-compare-porter-to-english/

These two stemmers differ much more than cases I was aiming to handle,
but it is still possible to read through and review the changes.

This analysis assumes we don't really care what the actual stems are,
and just works based on sets of words which stem to the same stem with
one stemmer vs the sets with the other stemmer.  This is a bit of an
extreme position, but it's already easy to look at the stems (stemwords
-p2 for example).

The script to generate this is scripts/stemmer-compare in the
snowball-data git repo.

It's very new but I thought it worth announcing at this point as it
should already be useful to people developing new stemming algorithms.

To briefly explain the output:

* Words which "aren't interesting" changed stem but are in sets with
  exactly the same other words before and after.

* Merges are simple cases where two or more sets before become a single
  set after.

* Splits are simple cases where a single set before becomes two or more
  sets after.

* Then the remaining more complex cases are shown as a series of graphs.
  The arrows show which sets words move between (to see which words, you
  can look at the sets).

I could just throw everything apart from the first category at graphviz
and get graphs for the merges and splits too.  I'd already written code
to handle these which gives more compact output, but it'd be easy to
have an option to show graphs for merges and splits too if people would
like one.

Cheers,
    Olly



More information about the Snowball-discuss mailing list