[Snowball-discuss] Relation/table of stems

Thu Feb 25 21:52:10 GMT 2010

John,

Chances are you're not going to find the direct kind of help I think you need on this list. The reason being that most project mailing lists are focused on the project itself, and since they depend on volunteers there are not a lot of resources to accommodate requests for new projects. I'm not in charge or even on this project, other than using it and loving it, so I could be wrong on this account (although I don't think I am).

If you want to continue this discussion off list I'd be happy to help you flesh out your ideas enough so that a programmer could work on it, as well as direct you to some programmers who may be able to take the job on.

Robert

On Feb 25, 2010, at 4:42 PM , John Gage wrote:

> Robert,
> 
> This is a fair question.  I want to be able to use a unique word as a primary key into a table to obtain the word's stem and then, using the stem, obtain all the words in the table derived from the same stem...a kind of orthographic thesaurus.  You may ask why I want an orthographic thesaurus, and my response would be that I am still in the formative stages of understanding that question myself.
> 
> Once again, I apologize for not quite being up to speed.
> 
> John
> 
> 
> On Thu, Feb 25, 2010 at 10:31 PM, tedivm <tedivm at tedivm.com> wrote:
> John,
> 
> If you're not a programmer what are you using this list of words for? I'm just trying to understand your ultimate goal with this.
> 
> Robert
> 
> 
> 
> 
> On Feb 25, 2010, at 4:14 PM , John Gage wrote:
> 
>> Martin and Robert,
>> 
>> I am afraid that my level of ignorance exceeds your worst expectations, so to speak.
>> 
>> In the first place, thank you very much for your replies.  They are completely on target.  I have found the lists mentioned by Robert and, yes, they are precisely what I want.  I have downloaded the word list mentioned by Martin, and it is exceptional.
>> 
>> But I am not even remotely a programmer.  To be honest, I am moderately stumped by the pseudo-code.  What would be marvelous would be an executable that I could run to get the result I need.  If there is a script language version that I could run unchanged or nearly unchanged, that would be great.
>> 
>> I am truly sorry to come on the scene unprepared, but this is something I really, really want.
>> 
>> Tell me how I can help myself, please.
>> 
>> Thanking you,
>> 
>> John Gage
>> 
>> 
>> 
>> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>> 
>>> 
>>> Most unix distros come with a giant word list, used for spell checkers. The format is pretty simple, one word per line. OSX ships with two versions of Websters, a list of proper names, and a list of stop words all located in /usr/share/dict/ 
>>> 
>>> From there it should only take a small script to accomplish what you need. Assuming you have two tables, one storing stems and one storing words, the pseudo code would look like this-
>>> 
>>> for each line assign to Word
>>> 	Build Stem
>>> 	Insert/Replace Stem, retrieving ID
>>> 	insert Word and Stem id
>>> 	
>>> if you wanted to make it more portable you could build out a text file with the word and its stem on the same line, with a tab delimiter or something between them. I know there are a few lists like that already on the Snowball site, and they're extremely useful for testing (when I made a php port of Porter2 it was ridiculously helpful).
>>> 
>>> Robert
>>> 
>>> 
>>> 
>>> 
>>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>> 
>>>> In his article, "Snowball: A language for stemming algorithms", Porter states,
>>>> 
>>>> "More flexibility however is obtained by indexing all words in a text in an unstemmed form, and keeping a separate two-column relation which connects the words to their stemmed equivalents. The relation can be denoted by R(s, w), which means that s is the stemmed form of word w. From the relation we can get, for any word w, its unique stemmed form, stem(w), and for any stem s, the set of words, words(s), that stem to s."
>>>> 
>>>> My interest is in such a two-column relation (table) for all the words in the English language, all the words in the English language, that is, that are available in an open source list available on the Internet.
>>>> 
>>>> There are several lists available, but the one that stands out for me for the moment is Grady Ward's list.
>>>> 
>>>> My interest is even narrower, because what I would like to do is store that relation/table as a table in a PostgreSQL database on my computer.
>>>> 
>>>> I wonder if any of this has already been done?  Shouldn't there be tables of words and their stems out there already?  I have searched and not found them.  I suppose "lemmas" might do something similar, but once again, I have searched and not found.
>>>> 
>>>> On the other hand, what would it take to create such a list?
>>>> 
>>>> Sorry to be such a novice and so helpless.
>>>> _______________________________________________
>>>> Snowball-discuss mailing list
>>>> Snowball-discuss at lists.tartarus.org
>>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>> 
>> 
> 
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100225/8fa642be/attachment-0001.htm>