[Snowball-discuss] Relation/table of stems ERROR
John Gage
jsmgage at numericable.fr
Tue Jul 20 12:39:20 BST 2010
This was re-sent in error. My apologies.
John Gage
On Feb 25, 2010, at 10:14 PM, John Gage wrote:
> Martin and Robert,
>
> I am afraid that my level of ignorance exceeds your worst
> expectations, so to speak.
>
> In the first place, thank you very much for your replies. They are
> completely on target. I have found the lists mentioned by Robert
> and, yes, they are precisely what I want. I have downloaded the
> word list mentioned by Martin, and it is exceptional.
>
> But I am not even remotely a programmer. To be honest, I am
> moderately stumped by the pseudo-code. What would be marvelous
> would be an executable that I could run to get the result I need.
> If there is a script language version that I could run unchanged or
> nearly unchanged, that would be great.
>
> I am truly sorry to come on the scene unprepared, but this is
> something I really, really want.
>
> Tell me how I can help myself, please.
>
> Thanking you,
>
> John Gage
>
>
>
> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>
>>
>> Most unix distros come with a giant word list, used for spell
>> checkers. The format is pretty simple, one word per line. OSX ships
>> with two versions of Websters, a list of proper names, and a list
>> of stop words all located in /usr/share/dict/
>>
>> From there it should only take a small script to accomplish what
>> you need. Assuming you have two tables, one storing stems and one
>> storing words, the pseudo code would look like this-
>>
>> for each line assign to Word
>> Build Stem
>> Insert/Replace Stem, retrieving ID
>> insert Word and Stem id
>>
>> if you wanted to make it more portable you could build out a text
>> file with the word and its stem on the same line, with a tab
>> delimiter or something between them. I know there are a few lists
>> like that already on the Snowball site, and they're extremely
>> useful for testing (when I made a php port of Porter2 it was
>> ridiculously helpful).
>>
>> Robert
>>
>>
>>
>>
>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>
>>> In his article, "Snowball: A language for stemming algorithms",
>>> Porter states,
>>>
>>> "More flexibility however is obtained by indexing all words in a
>>> text in an unstemmed form, and keeping a separate two-column
>>> relation which connects the words to their stemmed equivalents.
>>> The relation can be denoted by R(s, w), which means that s is the
>>> stemmed form of word w. From the relation we can get, for any word
>>> w, its unique stemmed form, stem(w), and for any stem s, the set
>>> of words, words(s), that stem to s."
>>>
>>> My interest is in such a two-column relation (table) for all the
>>> words in the English language, all the words in the English
>>> language, that is, that are available in an open source list
>>> available on the Internet.
>>>
>>> There are several lists available, but the one that stands out for
>>> me for the moment is Grady Ward's list.
>>>
>>> My interest is even narrower, because what I would like to do is
>>> store that relation/table as a table in a PostgreSQL database on
>>> my computer.
>>>
>>> I wonder if any of this has already been done? Shouldn't there be
>>> tables of words and their stems out there already? I have
>>> searched and not found them. I suppose "lemmas" might do
>>> something similar, but once again, I have searched and not found.
>>>
>>> On the other hand, what would it take to create such a list?
>>>
>>> Sorry to be such a novice and so helpless.
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
More information about the Snowball-discuss
mailing list