[Snowball-discuss] transition matrix or diagram for suffixes

Wed Jul 8 13:47:56 BST 2009

I am looking for a way to represent the "transitions" between words having
the same stem but different suffixes.  That would then be used to
investigate how to identify the likely part of speech of a word, and then
that could be used to help with stemming. 

One way to show this information is with a diagram having boxes and arrows.
I have seen that, but now I cannot find it. 

For example there would be several arrows from the stem to boxes labeled -s,
-ed, -ing, etc, showing that those suffixes can be added to the stem.  There
would be an arrow from -ing to -s showing that the compound suffix -ings can
be generated, but there would not be an arrow from -ed to -s.

Another way is to use a transition matrix.  The matrix I am looking for need
not have this exact format; this is an illustration.   It shows the source
along the vertical axis, and the target on the horizontal axis.  Each
allowed transition is marked.  Below is a tiny portion of such a matrix.
Notes: I use 0 (zero) to denote the stem.  I include some suffixes that are
not removed by the Porter stemmer, e.g. -er and -est.  I include compound
suffixes, e.g., -ings, because I think this is needed for the matrix to be
correct.    

        0 -s -ed -ing -ings -ly -er -est

0          *  *   *    *     *   *   * 

-s                                   

-ed                          *   *   *

-ing       *                 *

-ings

-ly

-er        *

-est

Note that the -er suffix shows an interesting phenomenon.  Whether it is
allowed depends on the part of speech of the input.  For example if stem is
an adjective then adding -er makes the comparative form of that adjective, 

e.g., red -> redder.   But if the source is a verb then adding -er makes a
noun, e.g, run -> runner.  It is only the latter form

that allows the -s suffix to be added.  So I am also interested in any way
to represent the part of speech (POS) labels for the various forms, and
include them in the transition matrix or diagram.  One idea is to attached
the PS label to the items on the axes.  Then there would be two rows in the
source column, labeled e.g. -er/J (for adjective) and -er/N (for noun).  

If we have a word list (really a set of words) we can infer the part of
speech of many words.  

For example, if 0, 0-er, and 0-est all exist then it is likely that 0 is an
adjective.   So having red, redder, and reddest in a word list makes it
likely that red is an adjective, and that the other two forms can be stemmed
to red.

I am trying to represent those probabilistic implications, using a diagram
or matrix.   Has anyone done this?  I would think so.  Or is another
formalism used?  Perhaps a set of "rules" and a non-procedural solver. 

Finally, consider the case where we have a bag of words, i.e. each word has
a count.    What kind of matrix, diagram or other kind of model should be
used for this analysis? 

 I looked at Bayesian networks, influence diagrams, and Hidden Markov Models
(HMM) a little.  Probably I do not understand them well enough, But I do not
see how to use them for this purpose.  

Thanks,

Steve Tolkin  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20090708/6289d607/attachment.htm