[Snowball-discuss] Can snowball be run backwards to generate words?

Martin Porter martin_porter@softhome.net
Sat, 22 Dec 2001 14:56:28 -0700


You can turn the Porter stemmer inside out, and generate all endings that
the stemmer will recognise, but there are several problems. One is that the
endings go in a circles, e.g.

   ize + ation as in realization
   ation + al as in operational
   al + ize as in normalize

- suggesting infinite endings izationalizational... You can break the loop
by noting that four is the upper limit on the number of derivational
suffixes that can be attached to a word in English.

If you do this, you end up with really quite a lot of endings. Here is a
list I put together recently,

Inflexional: ed  ing  ings  s

Derivational:
            ic          ioned       *ationed     *icationed
    *izationed   *alizationed           ered        *izered
     *alizered    *icalizered   *ionalizered           ated
        icated           ized         alized      *icalized
    *ionalized   *ationalized           ance           ence
          able           ible            ate          icate
           ive          ative        icative            ize
         alize       *icalize      *ionalize    *ationalize
        ioning      *ationing    *icationing    *izationing
 *alizationing          ering       *izering     *alizering
  *icalizering  *ionalizering          ating        icating
         izing       *alizing     *icalizing    *ionalizing
 *ationalizing             al           ical          ional
       ational     *icational     *izational            ful
           ism          alism       *icalism      *ionalism
   *ationalism            ion          ation        ication
       ization      alization             er           izer
       *alizer      *icalizer     *ionalizer           ator
           ics          ances          ences         ancies
        encies          ities        icities        alities
    *icalities     ionalities  *ationalities      abilities
     ibilities       *ivities     *ativities   *icativities
         ables          ibles         nesses     *ivenesses
  *ativenesses *icativenesses      *alnesses    *icalnesses
  *ionalnesses *ationalnesses     *fulnesses     *ousnesses
          ates         icates           ives         atives
     *icatives           izes        *alizes      *icalizes
    *ionalizes   *ationalizes            als          icals
        ionals      *ationals    *icationals    *izationals
          isms        *alisms      *icalisms     *ionalisms
  *ationalisms           ions         ations       ications
      izations    *alizations            ers          izers
      *alizers     *icalizers    *ionalizers          ators
          ness        iveness     *ativeness   *icativeness
        alness      *icalness      ionalness   *ationalness
       fulness        ousness           ants           ents
         ments         ements            ous            ant
           ent           ment          ement           ancy
          ency             ly           ably           ibly
         ately       *icately          ively        atively
    *icatively           ally         ically        ionally
     ationally          ously          ently        *mently
      *emently            ity          icity          ality
       icality       ionality    *ationality        ability
       ibility          ivity       *ativity     *icativity

- sorted by ending and arranged in 4 columns. The endings marked * are very
rare or non-existent and could be ignored. There are some extra rules:
endings beginning ion should follow s or t in the stem. This is a minimum
list: you can argue for other forms (ableness for example).

If a word is se, where s is the stem and e the ending, looking up all the s*
where * is any of these endings could be quite expensive therefore. 

Sometimes classes of endings can be eliminated on grammatical grounds. For
example, ness forms nouns from adjectives, and able forms adjectives from
nouns, so you would not expect them to attach to the same word. But there
are many exceptions to rules like this.

I think ending generation helps understand stemmers, but I'm not sure that
classes of endings are utilizable by IR systems, if only because there are
so many of them.

Martin


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss