[Snowball-discuss] Relation/table of stems

John Gage jsmgage at gmail.com
Fri Feb 26 03:44:56 GMT 2010


Robert,

Now I will be guilty of answering a question with a question, but given that
I tend to agree with you, may I ask what the project is?  That is to say,
given the amount of code that is in the project, is the only purpose of the
project to write more code?  But if that were true, wouldn't there be a part
of the project that wanted that code to be used outside the confines of the
people doing the coding?

John



On Thu, Feb 25, 2010 at 10:52 PM, tedivm <tedivm at tedivm.com> wrote:

> John,
>
> Chances are you're not going to find the direct kind of help I think you
> need on this list. The reason being that most project mailing lists are
> focused on the project itself, and since they depend on volunteers there are
> not a lot of resources to accommodate requests for new projects. I'm not in
> charge or even on this project, other than using it and loving it, so I
> could be wrong on this account (although I don't think I am).
>
> If you want to continue this discussion off list I'd be happy to help you
> flesh out your ideas enough so that a programmer could work on it, as well
> as direct you to some programmers who may be able to take the job on.
>
> Robert
>
>
>
> On Feb 25, 2010, at 4:42 PM , John Gage wrote:
>
> Robert,
>
> This is a fair question.  I want to be able to use a unique word as a
> primary key into a table to obtain the word's stem and then, using the stem,
> obtain all the words in the table derived from the same stem...a kind of
> orthographic thesaurus.  You may ask why I want an orthographic thesaurus,
> and my response would be that I am still in the formative stages of
> understanding that question myself.
>
> Once again, I apologize for not quite being up to speed.
>
> John
>
>
> On Thu, Feb 25, 2010 at 10:31 PM, tedivm <tedivm at tedivm.com> wrote:
>
>> John,
>>
>> If you're not a programmer what are you using this list of words for? I'm
>> just trying to understand your ultimate goal with this.
>>
>> Robert
>>
>>
>>
>>
>> On Feb 25, 2010, at 4:14 PM , John Gage wrote:
>>
>> Martin and Robert,
>>
>> I am afraid that my level of ignorance exceeds your worst expectations, so
>> to speak.
>>
>> In the first place, thank you very much for your replies.  They are
>> completely on target.  I have found the lists mentioned by Robert and, yes,
>> they are precisely what I want.  I have downloaded the word list mentioned
>> by Martin, and it is exceptional.
>>
>> But I am not even remotely a programmer.  To be honest, I am moderately
>> stumped by the pseudo-code.  What would be marvelous would be an executable
>> that I could run to get the result I need.  If there is a script language
>> version that I could run unchanged or nearly unchanged, that would be great.
>>
>> I am truly sorry to come on the scene unprepared, but this is something I
>> really, really want.
>>
>> Tell me how I can help myself, please.
>>
>> Thanking you,
>>
>> John Gage
>>
>>
>>
>> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
>>
>>
>> Most unix distros come with a giant word list, used for spell
>> checkers. The format is pretty simple, one word per line. OSX ships with two
>> versions of Websters, a list of proper names, and a list of stop words all
>> located in /usr/share/dict/
>>
>> From there it should only take a small script to accomplish what you
>> need. Assuming you have two tables, one storing stems and one storing words,
>> the pseudo code would look like this-
>>
>> for each line assign to Word
>> Build Stem
>> Insert/Replace Stem, retrieving ID
>> insert Word and Stem id
>>  if you wanted to make it more portable you could build out a text file
>> with the word and its stem on the same line, with a tab delimiter or
>> something between them. I know there are a few lists like that already on
>> the Snowball site, and they're extremely useful for testing (when I made a
>> php port of Porter2 it was ridiculously helpful).
>>
>> Robert
>>
>>
>>
>>
>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
>>
>> In his article, "Snowball: A language for stemming algorithms<http://snowball.tartarus.org/texts/introduction.html>",
>> Porter states,
>>
>> "*More flexibility however is obtained by indexing all words in a text in
>> an unstemmed form, and keeping a separate two-column relation which connects
>> the words to their stemmed equivalents. The relation can be denoted by R(s,
>> w), which means that s is the stemmed form of word w. From the relation we
>> can get, for any word w, its unique stemmed form, stem(w), and for any stem
>> s, the set of words, words(s), that stem to s.*"
>>
>> My interest is in such a two-column relation (table) for all the words in
>> the English language, all the words in the English language, that is, that
>> are available in an open source list available on the Internet.
>>
>> There are several lists available, but the one that stands out for me for
>> the moment is Grady Ward's list <http://icon.shef.ac.uk/Moby/mwords.html>
>> .
>>
>> My interest is even narrower, because what I would like to do is store
>> that relation/table as a table in a PostgreSQL database on my computer.
>>
>> I wonder if any of this has already been done?  Shouldn't there be tables
>> of words and their stems out there already?  I have searched and not found
>> them.  I suppose "lemmas" might do something similar, but once again, I have
>> searched and not found.
>>
>> On the other hand, what would it take to create such a list?
>>
>> Sorry to be such a novice and so helpless.
>> _______________________________________________
>> Snowball-discuss mailing list
>> Snowball-discuss at lists.tartarus.org
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>>
>>
>>
>>
>> _______________________________________________
>> Snowball-discuss mailing list
>> Snowball-discuss at lists.tartarus.org
>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100226/df04304f/attachment.htm>


More information about the Snowball-discuss mailing list