[Snowball-discuss] Relation/table of stems

Trevor Strohman strohman at cs.umass.edu
Fri Feb 26 04:48:24 GMT 2010


I just built the table, and I'll send it to John directly.

On Thu, Feb 25, 2010 at 7:44 PM, John Gage <jsmgage at gmail.com> wrote:
> Robert,
>
> Now I will be guilty of answering a question with a question, but given that
> I tend to agree with you, may I ask what the project is?  That is to say,
> given the amount of code that is in the project, is the only purpose of the
> project to write more code?  But if that were true, wouldn't there be a part
> of the project that wanted that code to be used outside the confines of the
> people doing the coding?
>
> John
>
>
>
> On Thu, Feb 25, 2010 at 10:52 PM, tedivm <tedivm at tedivm.com> wrote:
>>
>> John,
>> Chances are you're not going to find the direct kind of help I think you
>> need on this list. The reason being that most project mailing lists are
>> focused on the project itself, and since they depend on volunteers there are
>> not a lot of resources to accommodate requests for new projects. I'm not in
>> charge or even on this project, other than using it and loving it, so I
>> could be wrong on this account (although I don't think I am).
>> If you want to continue this discussion off list I'd be happy to help you
>> flesh out your ideas enough so that a programmer could work on it, as well
>> as direct you to some programmers who may be able to take the job on.
>> Robert
>>
>>
>> On Feb 25, 2010, at 4:42 PM , John Gage wrote:
>>
>> Robert,
>>
>> This is a fair question.  I want to be able to use a unique word as a
>> primary key into a table to obtain the word's stem and then, using the stem,
>> obtain all the words in the table derived from the same stem...a kind of
>> orthographic thesaurus.  You may ask why I want an orthographic thesaurus,
>> and my response would be that I am still in the formative stages of
>> understanding that question myself.
>>
>> Once again, I apologize for not quite being up to speed.
>>
>> John
>>
>>
>> On Thu, Feb 25, 2010 at 10:31 PM, tedivm <tedivm at tedivm.com> wrote:
>>>
>>> John,
>>> If you're not a programmer what are you using this list of words for? I'm
>>> just trying to understand your ultimate goal with this.
>>> Robert
>>>
>>>
>>>
>>> On Feb 25, 2010, at 4:14 PM , John Gage wrote:
>>>
>>> Martin and Robert,
>>> I am afraid that my level of ignorance exceeds your worst expectations,
>>> so to speak.
>>> In the first place, thank you very much for your replies.  They are
>>> completely on target.  I have found the lists mentioned by Robert and, yes,
>>> they are precisely what I want.  I have downloaded the word list mentioned
>>> by Martin, and it is exceptional.
>>> But I am not even remotely a programmer.  To be honest, I am moderately
>>> stumped by the pseudo-code.  What would be marvelous would be an executable
>>> that I could run to get the result I need.  If there is a script language
>>> version that I could run unchanged or nearly unchanged, that would be great.
>>> I am truly sorry to come on the scene unprepared, but this is something I
>>> really, really want.
>>> Tell me how I can help myself, please.
>>> Thanking you,
>>> John Gage
>>>
>>>
>>> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>>>
>>> Most unix distros come with a giant word list, used for spell
>>> checkers. The format is pretty simple, one word per line. OSX ships with two
>>> versions of Websters, a list of proper names, and a list of stop words all
>>> located in /usr/share/dict/
>>> From there it should only take a small script to accomplish what you
>>> need. Assuming you have two tables, one storing stems and one storing words,
>>> the pseudo code would look like this-
>>> for each line assign to Word
>>> Build Stem
>>> Insert/Replace Stem, retrieving ID
>>> insert Word and Stem id
>>> if you wanted to make it more portable you could build out a text file
>>> with the word and its stem on the same line, with a tab delimiter or
>>> something between them. I know there are a few lists like that already on
>>> the Snowball site, and they're extremely useful for testing (when I made a
>>> php port of Porter2 it was ridiculously helpful).
>>> Robert
>>>
>>>
>>>
>>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>>
>>> In his article, "Snowball: A language for stemming algorithms", Porter
>>> states,
>>>
>>> "More flexibility however is obtained by indexing all words in a text in
>>> an unstemmed form, and keeping a separate two-column relation which connects
>>> the words to their stemmed equivalents. The relation can be denoted by R(s,
>>> w), which means that s is the stemmed form of word w. From the relation we
>>> can get, for any word w, its unique stemmed form, stem(w), and for any stem
>>> s, the set of words, words(s), that stem to s."
>>>
>>> My interest is in such a two-column relation (table) for all the words in
>>> the English language, all the words in the English language, that is, that
>>> are available in an open source list available on the Internet.
>>>
>>> There are several lists available, but the one that stands out for me for
>>> the moment is Grady Ward's list.
>>>
>>> My interest is even narrower, because what I would like to do is store
>>> that relation/table as a table in a PostgreSQL database on my computer.
>>>
>>> I wonder if any of this has already been done?  Shouldn't there be tables
>>> of words and their stems out there already?  I have searched and not found
>>> them.  I suppose "lemmas" might do something similar, but once again, I have
>>> searched and not found.
>>>
>>> On the other hand, what would it take to create such a list?
>>>
>>> Sorry to be such a novice and so helpless.
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>
>>
>>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>



More information about the Snowball-discuss mailing list