[Snowball-discuss] Relation/table of stems

John Gage jsmgage at gmail.com
Sat Feb 27 10:40:14 GMT 2010


I just wanted to say thank you to tedivm, Martin Porter, and Trevor for your
understanding and help.  The files sent to me by Trevor are exactly what I
wanted.  Precisely.

Thank you all again,

John Gage

On Fri, Feb 26, 2010 at 5:48 AM, Trevor Strohman <strohman at cs.umass.edu>wrote:

> I just built the table, and I'll send it to John directly.
>
> On Thu, Feb 25, 2010 at 7:44 PM, John Gage <jsmgage at gmail.com> wrote:
> > Robert,
> >
> > Now I will be guilty of answering a question with a question, but given
> that
> > I tend to agree with you, may I ask what the project is?  That is to say,
> > given the amount of code that is in the project, is the only purpose of
> the
> > project to write more code?  But if that were true, wouldn't there be a
> part
> > of the project that wanted that code to be used outside the confines of
> the
> > people doing the coding?
> >
> > John
> >
> >
> >
> > On Thu, Feb 25, 2010 at 10:52 PM, tedivm <tedivm at tedivm.com> wrote:
> >>
> >> John,
> >> Chances are you're not going to find the direct kind of help I think you
> >> need on this list. The reason being that most project mailing lists are
> >> focused on the project itself, and since they depend on volunteers there
> are
> >> not a lot of resources to accommodate requests for new projects. I'm not
> in
> >> charge or even on this project, other than using it and loving it, so I
> >> could be wrong on this account (although I don't think I am).
> >> If you want to continue this discussion off list I'd be happy to help
> you
> >> flesh out your ideas enough so that a programmer could work on it, as
> well
> >> as direct you to some programmers who may be able to take the job on.
> >> Robert
> >>
> >>
> >> On Feb 25, 2010, at 4:42 PM , John Gage wrote:
> >>
> >> Robert,
> >>
> >> This is a fair question.  I want to be able to use a unique word as a
> >> primary key into a table to obtain the word's stem and then, using the
> stem,
> >> obtain all the words in the table derived from the same stem...a kind of
> >> orthographic thesaurus.  You may ask why I want an orthographic
> thesaurus,
> >> and my response would be that I am still in the formative stages of
> >> understanding that question myself.
> >>
> >> Once again, I apologize for not quite being up to speed.
> >>
> >> John
> >>
> >>
> >> On Thu, Feb 25, 2010 at 10:31 PM, tedivm <tedivm at tedivm.com> wrote:
> >>>
> >>> John,
> >>> If you're not a programmer what are you using this list of words for?
> I'm
> >>> just trying to understand your ultimate goal with this.
> >>> Robert
> >>>
> >>>
> >>>
> >>> On Feb 25, 2010, at 4:14 PM , John Gage wrote:
> >>>
> >>> Martin and Robert,
> >>> I am afraid that my level of ignorance exceeds your worst expectations,
> >>> so to speak.
> >>> In the first place, thank you very much for your replies.  They are
> >>> completely on target.  I have found the lists mentioned by Robert and,
> yes,
> >>> they are precisely what I want.  I have downloaded the word list
> mentioned
> >>> by Martin, and it is exceptional.
> >>> But I am not even remotely a programmer.  To be honest, I am moderately
> >>> stumped by the pseudo-code.  What would be marvelous would be an
> executable
> >>> that I could run to get the result I need.  If there is a script
> language
> >>> version that I could run unchanged or nearly unchanged, that would be
> great.
> >>> I am truly sorry to come on the scene unprepared, but this is something
> I
> >>> really, really want.
> >>> Tell me how I can help myself, please.
> >>> Thanking you,
> >>> John Gage
> >>>
> >>>
> >>> On Feb 25, 2010, at 8:28 AM, tedivm wrote:
> >>>
> >>> Most unix distros come with a giant word list, used for spell
> >>> checkers. The format is pretty simple, one word per line. OSX ships
> with two
> >>> versions of Websters, a list of proper names, and a list of stop words
> all
> >>> located in /usr/share/dict/
> >>> From there it should only take a small script to accomplish what you
> >>> need. Assuming you have two tables, one storing stems and one storing
> words,
> >>> the pseudo code would look like this-
> >>> for each line assign to Word
> >>> Build Stem
> >>> Insert/Replace Stem, retrieving ID
> >>> insert Word and Stem id
> >>> if you wanted to make it more portable you could build out a text file
> >>> with the word and its stem on the same line, with a tab delimiter or
> >>> something between them. I know there are a few lists like that already
> on
> >>> the Snowball site, and they're extremely useful for testing (when I
> made a
> >>> php port of Porter2 it was ridiculously helpful).
> >>> Robert
> >>>
> >>>
> >>>
> >>> On Feb 25, 2010, at 2:09 AM , John Gage wrote:
> >>>
> >>> In his article, "Snowball: A language for stemming algorithms", Porter
> >>> states,
> >>>
> >>> "More flexibility however is obtained by indexing all words in a text
> in
> >>> an unstemmed form, and keeping a separate two-column relation which
> connects
> >>> the words to their stemmed equivalents. The relation can be denoted by
> R(s,
> >>> w), which means that s is the stemmed form of word w. From the relation
> we
> >>> can get, for any word w, its unique stemmed form, stem(w), and for any
> stem
> >>> s, the set of words, words(s), that stem to s."
> >>>
> >>> My interest is in such a two-column relation (table) for all the words
> in
> >>> the English language, all the words in the English language, that is,
> that
> >>> are available in an open source list available on the Internet.
> >>>
> >>> There are several lists available, but the one that stands out for me
> for
> >>> the moment is Grady Ward's list.
> >>>
> >>> My interest is even narrower, because what I would like to do is store
> >>> that relation/table as a table in a PostgreSQL database on my computer.
> >>>
> >>> I wonder if any of this has already been done?  Shouldn't there be
> tables
> >>> of words and their stems out there already?  I have searched and not
> found
> >>> them.  I suppose "lemmas" might do something similar, but once again, I
> have
> >>> searched and not found.
> >>>
> >>> On the other hand, what would it take to create such a list?
> >>>
> >>> Sorry to be such a novice and so helpless.
> >>> _______________________________________________
> >>> Snowball-discuss mailing list
> >>> Snowball-discuss at lists.tartarus.org
> >>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Snowball-discuss mailing list
> >>> Snowball-discuss at lists.tartarus.org
> >>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> >>>
> >>
> >>
> >
> >
> > _______________________________________________
> > Snowball-discuss mailing list
> > Snowball-discuss at lists.tartarus.org
> > http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100227/94f88520/attachment.htm>


More information about the Snowball-discuss mailing list