[Snowball-discuss] Relation/table of stems

John Gage jsmgage at numericable.fr
Thu Feb 25 21:14:57 GMT 2010


Martin and Robert,

I am afraid that my level of ignorance exceeds your worst  
expectations, so to speak.

In the first place, thank you very much for your replies.  They are  
completely on target.  I have found the lists mentioned by Robert and,  
yes, they are precisely what I want.  I have downloaded the word list  
mentioned by Martin, and it is exceptional.

But I am not even remotely a programmer.  To be honest, I am  
moderately stumped by the pseudo-code.  What would be marvelous would  
be an executable that I could run to get the result I need.  If there  
is a script language version that I could run unchanged or nearly  
unchanged, that would be great.

I am truly sorry to come on the scene unprepared, but this is  
something I really, really want.

Tell me how I can help myself, please.

Thanking you,

John Gage



On Feb 25, 2010, at 8:28 AM, tedivm wrote:

>
> Most unix distros come with a giant word list, used for spell  
> checkers. The format is pretty simple, one word per line. OSX ships  
> with two versions of Websters, a list of proper names, and a list of  
> stop words all located in /usr/share/dict/
>
> From there it should only take a small script to accomplish what you  
> need. Assuming you have two tables, one storing stems and one  
> storing words, the pseudo code would look like this-
>
> for each line assign to Word
> 	Build Stem
> 	Insert/Replace Stem, retrieving ID
> 	insert Word and Stem id
> 	
> if you wanted to make it more portable you could build out a text  
> file with the word and its stem on the same line, with a tab  
> delimiter or something between them. I know there are a few lists  
> like that already on the Snowball site, and they're extremely useful  
> for testing (when I made a php port of Porter2 it was ridiculously  
> helpful).
>
> Robert
>
>
>
>
> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>
>> In his article, "Snowball: A language for stemming algorithms",  
>> Porter states,
>>
>> "More flexibility however is obtained by indexing all words in a  
>> text in an unstemmed form, and keeping a separate two-column  
>> relation which connects the words to their stemmed equivalents. The  
>> relation can be denoted by R(s, w), which means that s is the  
>> stemmed form of word w. From the relation we can get, for any word  
>> w, its unique stemmed form, stem(w), and for any stem s, the set of  
>> words, words(s), that stem to s."
>>
>> My interest is in such a two-column relation (table) for all the  
>> words in the English language, all the words in the English  
>> language, that is, that are available in an open source list  
>> available on the Internet.
>>
>> There are several lists available, but the one that stands out for  
>> me for the moment is Grady Ward's list.
>>
>> My interest is even narrower, because what I would like to do is  
>> store that relation/table as a table in a PostgreSQL database on my  
>> computer.
>>
>> I wonder if any of this has already been done?  Shouldn't there be  
>> tables of words and their stems out there already?  I have searched  
>> and not found them.  I suppose "lemmas" might do something similar,  
>> but once again, I have searched and not found.
>>
>> On the other hand, what would it take to create such a list?
>>
>> Sorry to be such a novice and so helpless.
>> _______________________________________________
>> Snowball-discuss mailing list
>> Snowball-discuss at lists.tartarus.org
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/4b67d514/attachment.htm>


More information about the Snowball-discuss mailing list