[Snowball-discuss] Words' list relating to word containing wildcard *, ?, #
Martin Porter
martin.f.porter at gmail.com
Fri Nov 6 18:25:25 GMT 2020
Eric,
It is possible to generate a list of all possible endings. Suppose the
set of all endings is E. It is also possible to find on the internet
almost-exhaustive worlists for the commonest languages. Suppose L is
such a list. If you have a form w*, where w is a string of characters
and * is supposed to be some valid suffix, you look at all the words
in L that begin with w, and collect those that end with a string that
belongs to E. Then this set (with some small margin of error) is the
valid "expansion set" for w.
See appendices 3 and 4 of
http://snowball.tartarus.org/algorithms/lovins/festschrift.html for
creating the list E for English. Unfortunately this is for English,
not French.
But I wonder if there is not a much easier approach. I seem to recall
that the multilingual spellcheckers in "open office" contain stems
plus suffix lists in the form
support/s/ing/ive/er...
This could be a simpler starting point. Snowball itself is not easily
used in reverse, that is, to generate from a stem or word w all valid
word forms that will stem to w.
I am not sure how you might handle forms *w or *w*. And for English at
least, there is the problem that the stemming process may cause
respelling in the stem. Thus support* includes supports, but gantry*
(a kind of support) does not include gantries.
Martin
More information about the Snowball-discuss
mailing list