[Snowball-discuss] Relation/table of stems
John Gage
jsmgage at numericable.fr
Thu Feb 25 21:14:57 GMT 2010
Martin and Robert,
I am afraid that my level of ignorance exceeds your worst
expectations, so to speak.
In the first place, thank you very much for your replies. They are
completely on target. I have found the lists mentioned by Robert and,
yes, they are precisely what I want. I have downloaded the word list
mentioned by Martin, and it is exceptional.
But I am not even remotely a programmer. To be honest, I am
moderately stumped by the pseudo-code. What would be marvelous would
be an executable that I could run to get the result I need. If there
is a script language version that I could run unchanged or nearly
unchanged, that would be great.
I am truly sorry to come on the scene unprepared, but this is
something I really, really want.
Tell me how I can help myself, please.
Thanking you,
John Gage
On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>
> Most unix distros come with a giant word list, used for spell
> checkers. The format is pretty simple, one word per line. OSX ships
> with two versions of Websters, a list of proper names, and a list of
> stop words all located in /usr/share/dict/
>
> From there it should only take a small script to accomplish what you
> need. Assuming you have two tables, one storing stems and one
> storing words, the pseudo code would look like this-
>
> for each line assign to Word
> Build Stem
> Insert/Replace Stem, retrieving ID
> insert Word and Stem id
>
> if you wanted to make it more portable you could build out a text
> file with the word and its stem on the same line, with a tab
> delimiter or something between them. I know there are a few lists
> like that already on the Snowball site, and they're extremely useful
> for testing (when I made a php port of Porter2 it was ridiculously
> helpful).
>
> Robert
>
>
>
>
> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>
>> In his article, "Snowball: A language for stemming algorithms",
>> Porter states,
>>
>> "More flexibility however is obtained by indexing all words in a
>> text in an unstemmed form, and keeping a separate two-column
>> relation which connects the words to their stemmed equivalents. The
>> relation can be denoted by R(s, w), which means that s is the
>> stemmed form of word w. From the relation we can get, for any word
>> w, its unique stemmed form, stem(w), and for any stem s, the set of
>> words, words(s), that stem to s."
>>
>> My interest is in such a two-column relation (table) for all the
>> words in the English language, all the words in the English
>> language, that is, that are available in an open source list
>> available on the Internet.
>>
>> There are several lists available, but the one that stands out for
>> me for the moment is Grady Ward's list.
>>
>> My interest is even narrower, because what I would like to do is
>> store that relation/table as a table in a PostgreSQL database on my
>> computer.
>>
>> I wonder if any of this has already been done? Shouldn't there be
>> tables of words and their stems out there already? I have searched
>> and not found them. I suppose "lemmas" might do something similar,
>> but once again, I have searched and not found.
>>
>> On the other hand, what would it take to create such a list?
>>
>> Sorry to be such a novice and so helpless.
>> _______________________________________________
>> Snowball-discuss mailing list
>> Snowball-discuss at lists.tartarus.org
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/4b67d514/attachment.htm>
More information about the Snowball-discuss
mailing list