[Snowball-discuss] Relation/table of stems

tedivm tedivm at tedivm.com
Thu Feb 25 07:28:27 GMT 2010


Most unix distros come with a giant word list, used for spell checkers. The format is pretty simple, one word per line. OSX ships with two versions of Websters, a list of proper names, and a list of stop words all located in /usr/share/dict/ 

From there it should only take a small script to accomplish what you need. Assuming you have two tables, one storing stems and one storing words, the pseudo code would look like this-

for each line assign to Word
	Build Stem
	Insert/Replace Stem, retrieving ID
	insert Word and Stem id
	
if you wanted to make it more portable you could build out a text file with the word and its stem on the same line, with a tab delimiter or something between them. I know there are a few lists like that already on the Snowball site, and they're extremely useful for testing (when I made a php port of Porter2 it was ridiculously helpful).

Robert




On Feb 25, 2010, at 2:09 AM , John Gage wrote:

> In his article, "Snowball: A language for stemming algorithms", Porter states,
> 
> "More flexibility however is obtained by indexing all words in a text in an unstemmed form, and keeping a separate two-column relation which connects the words to their stemmed equivalents. The relation can be denoted by R(s, w), which means that s is the stemmed form of word w. From the relation we can get, for any word w, its unique stemmed form, stem(w), and for any stem s, the set of words, words(s), that stem to s."
> 
> My interest is in such a two-column relation (table) for all the words in the English language, all the words in the English language, that is, that are available in an open source list available on the Internet.
> 
> There are several lists available, but the one that stands out for me for the moment is Grady Ward's list.
> 
> My interest is even narrower, because what I would like to do is store that relation/table as a table in a PostgreSQL database on my computer.
> 
> I wonder if any of this has already been done?  Shouldn't there be tables of words and their stems out there already?  I have searched and not found them.  I suppose "lemmas" might do something similar, but once again, I have searched and not found.
> 
> On the other hand, what would it take to create such a list?
> 
> Sorry to be such a novice and so helpless.
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/34f82500/attachment.htm>


More information about the Snowball-discuss mailing list