[Snowball-discuss] Relation/table of stems ERROR

John Gage jsmgage at numericable.fr
Tue Jul 20 12:39:20 BST 2010


This was re-sent in error.  My apologies.

John Gage


On Feb 25, 2010, at 10:14 PM, John Gage wrote:

> Martin and Robert,
>
> I am afraid that my level of ignorance exceeds your worst  
> expectations, so to speak.
>
> In the first place, thank you very much for your replies.  They are  
> completely on target.  I have found the lists mentioned by Robert  
> and, yes, they are precisely what I want.  I have downloaded the  
> word list mentioned by Martin, and it is exceptional.
>
> But I am not even remotely a programmer.  To be honest, I am  
> moderately stumped by the pseudo-code.  What would be marvelous  
> would be an executable that I could run to get the result I need.   
> If there is a script language version that I could run unchanged or  
> nearly unchanged, that would be great.
>
> I am truly sorry to come on the scene unprepared, but this is  
> something I really, really want.
>
> Tell me how I can help myself, please.
>
> Thanking you,
>
> John Gage
>
>
>
> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>
>>
>> Most unix distros come with a giant word list, used for spell  
>> checkers. The format is pretty simple, one word per line. OSX ships  
>> with two versions of Websters, a list of proper names, and a list  
>> of stop words all located in /usr/share/dict/
>>
>> From there it should only take a small script to accomplish what  
>> you need. Assuming you have two tables, one storing stems and one  
>> storing words, the pseudo code would look like this-
>>
>> for each line assign to Word
>> 	Build Stem
>> 	Insert/Replace Stem, retrieving ID
>> 	insert Word and Stem id
>> 	
>> if you wanted to make it more portable you could build out a text  
>> file with the word and its stem on the same line, with a tab  
>> delimiter or something between them. I know there are a few lists  
>> like that already on the Snowball site, and they're extremely  
>> useful for testing (when I made a php port of Porter2 it was  
>> ridiculously helpful).
>>
>> Robert
>>
>>
>>
>>
>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>
>>> In his article, "Snowball: A language for stemming algorithms",  
>>> Porter states,
>>>
>>> "More flexibility however is obtained by indexing all words in a  
>>> text in an unstemmed form, and keeping a separate two-column  
>>> relation which connects the words to their stemmed equivalents.  
>>> The relation can be denoted by R(s, w), which means that s is the  
>>> stemmed form of word w. From the relation we can get, for any word  
>>> w, its unique stemmed form, stem(w), and for any stem s, the set  
>>> of words, words(s), that stem to s."
>>>
>>> My interest is in such a two-column relation (table) for all the  
>>> words in the English language, all the words in the English  
>>> language, that is, that are available in an open source list  
>>> available on the Internet.
>>>
>>> There are several lists available, but the one that stands out for  
>>> me for the moment is Grady Ward's list.
>>>
>>> My interest is even narrower, because what I would like to do is  
>>> store that relation/table as a table in a PostgreSQL database on  
>>> my computer.
>>>
>>> I wonder if any of this has already been done?  Shouldn't there be  
>>> tables of words and their stems out there already?  I have  
>>> searched and not found them.  I suppose "lemmas" might do  
>>> something similar, but once again, I have searched and not found.
>>>
>>> On the other hand, what would it take to create such a list?
>>>
>>> Sorry to be such a novice and so helpless.
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss




More information about the Snowball-discuss mailing list