[Snowball-discuss] RE: Snowball-discuss digest, Vol 1 #25 - 1 msg

Tolkin, Steve Steve.Tolkin@FMR.COM
Mon, 24 Dec 2001 09:39:42 -0500


Dear Martin,
	Thanks for this information.  I have a few comments.

1. The endings -er (and -ier) and -est (and -iest) for 
comparative and superlative forms of adjectives and adverbs 
seem to be missing.  I suggest
they belong in the list of Inflexional endings.

2. You said:
> This is a minimum
> list: you can argue for other forms (ableness for example).
So I presume this list was created in a somewhat manual way.
Another advantage of having a generate mode for Snowball:
Any change to the Snowball code and/or rules for a language
could be automatically tested by comparing the new list of 
endings with the list before the change.  
This would be very useful for QA (quality assurance).
As you point out, if a generate mode is added then there also
needs to be a way to set the maximum number.

3. You said:
> I think ending generation helps understand stemmers, but I'm 
> not sure that
> classes of endings are utilizable by IR systems, if only 
> because there are
> so many of them.

But modern computers are really fast and have large main memories
compared to years ago.  I think a system could generate all these, 
and look them up even in a very large wordlist in < 0.01 second.

However I agree that there are so many that it might be worthwhile
to try to reverse strategy, i.e start from the dictionary and
test all the words that share the same first several letters with 
the given word.  So my next question is to find a formula
for the maximum length common prefix.  Given a word, w,
we can find its stem, s, quickly.  Suppose the stem is of length
n.  Is there a formula, e.g. n-2, that ensures that all words 
having the same stem as w will begin with the first n-2 characters 
as s.  I suspect so.  Further I suspect that this formula
may be made more efficient by a few extra tests, e.g. if the
stem ends with "i" use n-2 otherwise n-1.
(That's an example -- the real rules are probably somewhat more 
complex.)  Given these rules it might be faster to scan the 
dictionary, and then generating and testing stems.

P.S.  Yes, keeping the name Snowball is fine.
I sent the earlier email so we would know about that
other project.
 
Hopefully helpfully yours,
Steve
-- 
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516 
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.


> -----Original Message-----
> From: snowball-discuss-request@lists.sourceforge.net
> [mailto:snowball-discuss-request@lists.sourceforge.net]
> Sent: Sunday, December 23, 2001 3:28 PM
> To: snowball-discuss@lists.sourceforge.net
> Subject: Snowball-discuss digest, Vol 1 #25 - 1 msg
...
>
> Message: 1
> To: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>
> From: martin_porter@softhome.net (Martin Porter)
> Subject: Re: [Snowball-discuss] Can snowball be run backwards 
> to generate words?
> Cc: snowball-discuss@lists.sourceforge.net
> Date: Sat, 22 Dec 2001 14:56:28 -0700
> 
> 
> You can turn the Porter stemmer inside out, and generate all 
> endings that
> the stemmer will recognise, but there are several problems. 
> One is that the
> endings go in a circles, e.g.
> 
>    ize + ation as in realization
>    ation + al as in operational
>    al + ize as in normalize
> 
> - suggesting infinite endings izationalizational... You can 
> break the loop
> by noting that four is the upper limit on the number of derivational
> suffixes that can be attached to a word in English.
> 
> If you do this, you end up with really quite a lot of 
> endings. Here is a
> list I put together recently,
> 
> Inflexional: ed  ing  ings  s
> 
> Derivational:
>             ic          ioned       *ationed     *icationed
>     *izationed   *alizationed           ered        *izered
>      *alizered    *icalizered   *ionalizered           ated
>         icated           ized         alized      *icalized
>     *ionalized   *ationalized           ance           ence
>           able           ible            ate          icate
>            ive          ative        icative            ize
>          alize       *icalize      *ionalize    *ationalize
>         ioning      *ationing    *icationing    *izationing
>  *alizationing          ering       *izering     *alizering
>   *icalizering  *ionalizering          ating        icating
>          izing       *alizing     *icalizing    *ionalizing
>  *ationalizing             al           ical          ional
>        ational     *icational     *izational            ful
>            ism          alism       *icalism      *ionalism
>    *ationalism            ion          ation        ication
>        ization      alization             er           izer
>        *alizer      *icalizer     *ionalizer           ator
>            ics          ances          ences         ancies
>         encies          ities        icities        alities
>     *icalities     ionalities  *ationalities      abilities
>      ibilities       *ivities     *ativities   *icativities
>          ables          ibles         nesses     *ivenesses
>   *ativenesses *icativenesses      *alnesses    *icalnesses
>   *ionalnesses *ationalnesses     *fulnesses     *ousnesses
>           ates         icates           ives         atives
>      *icatives           izes        *alizes      *icalizes
>     *ionalizes   *ationalizes            als          icals
>         ionals      *ationals    *icationals    *izationals
>           isms        *alisms      *icalisms     *ionalisms
>   *ationalisms           ions         ations       ications
>       izations    *alizations            ers          izers
>       *alizers     *icalizers    *ionalizers          ators
>           ness        iveness     *ativeness   *icativeness
>         alness      *icalness      ionalness   *ationalness
>        fulness        ousness           ants           ents
>          ments         ements            ous            ant
>            ent           ment          ement           ancy
>           ency             ly           ably           ibly
>          ately       *icately          ively        atively
>     *icatively           ally         ically        ionally
>      ationally          ously          ently        *mently
>       *emently            ity          icity          ality
>        icality       ionality    *ationality        ability
>        ibility          ivity       *ativity     *icativity
> 
> - sorted by ending and arranged in 4 columns. The endings 
> marked * are very
> rare or non-existent and could be ignored. There are some extra rules:
> endings beginning ion should follow s or t in the stem. This 
> is a minimum
> list: you can argue for other forms (ableness for example).
> 
> If a word is se, where s is the stem and e the ending, 
> looking up all the s*
> where * is any of these endings could be quite expensive therefore. 
> 
> Sometimes classes of endings can be eliminated on grammatical 
> grounds. For
> example, ness forms nouns from adjectives, and able forms 
> adjectives from
> nouns, so you would not expect them to attach to the same 
> word. But there
> are many exceptions to rules like this.
> 
> I think ending generation helps understand stemmers, but I'm 
> not sure that
> classes of endings are utilizable by IR systems, if only 
> because there are
> so many of them.
> 
> Martin
> 
> 
> 
> 
> --__--__--
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
> 
> 
> End of Snowball-discuss Digest
> 

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss