[Snowball-discuss] Relation/table of stems
tedivm
tedivm at tedivm.com
Thu Feb 25 21:31:20 GMT 2010
John,
If you're not a programmer what are you using this list of words for? I'm just trying to understand your ultimate goal with this.
Robert
On Feb 25, 2010, at 4:14 PM , John Gage wrote:
> Martin and Robert,
>
> I am afraid that my level of ignorance exceeds your worst expectations, so to speak.
>
> In the first place, thank you very much for your replies. They are completely on target. I have found the lists mentioned by Robert and, yes, they are precisely what I want. I have downloaded the word list mentioned by Martin, and it is exceptional.
>
> But I am not even remotely a programmer. To be honest, I am moderately stumped by the pseudo-code. What would be marvelous would be an executable that I could run to get the result I need. If there is a script language version that I could run unchanged or nearly unchanged, that would be great.
>
> I am truly sorry to come on the scene unprepared, but this is something I really, really want.
>
> Tell me how I can help myself, please.
>
> Thanking you,
>
> John Gage
>
>
>
> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>
>>
>> Most unix distros come with a giant word list, used for spell checkers. The format is pretty simple, one word per line. OSX ships with two versions of Websters, a list of proper names, and a list of stop words all located in /usr/share/dict/
>>
>> From there it should only take a small script to accomplish what you need. Assuming you have two tables, one storing stems and one storing words, the pseudo code would look like this-
>>
>> for each line assign to Word
>> Build Stem
>> Insert/Replace Stem, retrieving ID
>> insert Word and Stem id
>>
>> if you wanted to make it more portable you could build out a text file with the word and its stem on the same line, with a tab delimiter or something between them. I know there are a few lists like that already on the Snowball site, and they're extremely useful for testing (when I made a php port of Porter2 it was ridiculously helpful).
>>
>> Robert
>>
>>
>>
>>
>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>
>>> In his article, "Snowball: A language for stemming algorithms", Porter states,
>>>
>>> "More flexibility however is obtained by indexing all words in a text in an unstemmed form, and keeping a separate two-column relation which connects the words to their stemmed equivalents. The relation can be denoted by R(s, w), which means that s is the stemmed form of word w. From the relation we can get, for any word w, its unique stemmed form, stem(w), and for any stem s, the set of words, words(s), that stem to s."
>>>
>>> My interest is in such a two-column relation (table) for all the words in the English language, all the words in the English language, that is, that are available in an open source list available on the Internet.
>>>
>>> There are several lists available, but the one that stands out for me for the moment is Grady Ward's list.
>>>
>>> My interest is even narrower, because what I would like to do is store that relation/table as a table in a PostgreSQL database on my computer.
>>>
>>> I wonder if any of this has already been done? Shouldn't there be tables of words and their stems out there already? I have searched and not found them. I suppose "lemmas" might do something similar, but once again, I have searched and not found.
>>>
>>> On the other hand, what would it take to create such a list?
>>>
>>> Sorry to be such a novice and so helpless.
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/17ca565b/attachment.htm>
More information about the Snowball-discuss
mailing list