[Snowball-discuss] Relation/table of stems

John Gage jsmgage at gmail.com
Thu Feb 25 07:09:31 GMT 2010


In his article, "Snowball: A language for stemming
algorithms<http://snowball.tartarus.org/texts/introduction.html>",
Porter states,

"*More flexibility however is obtained by indexing all words in a text in an
unstemmed form, and keeping a separate two-column relation which connects
the words to their stemmed equivalents. The relation can be denoted by R(s,
w), which means that s is the stemmed form of word w. From the relation we
can get, for any word w, its unique stemmed form, stem(w), and for any stem
s, the set of words, words(s), that stem to s.*"

My interest is in such a two-column relation (table) for all the words in
the English language, all the words in the English language, that is, that
are available in an open source list available on the Internet.

There are several lists available, but the one that stands out for me for
the moment is Grady Ward's list <http://icon.shef.ac.uk/Moby/mwords.html>.

My interest is even narrower, because what I would like to do is store that
relation/table as a table in a PostgreSQL database on my computer.

I wonder if any of this has already been done?  Shouldn't there be tables of
words and their stems out there already?  I have searched and not found
them.  I suppose "lemmas" might do something similar, but once again, I have
searched and not found.

On the other hand, what would it take to create such a list?

Sorry to be such a novice and so helpless.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/ab338a0e/attachment.htm>


More information about the Snowball-discuss mailing list