[Snowball-discuss] Re: Reques of Advise
Martin Porter
martin_porter@softhome.net
Mon Jun 23 09:46:02 2003
Lemma Lessa,
I am glad to hear you are making progress.
The Porter stemmer, as written, takes its input from the list of files on
the command line, and sends its output to stdout. These ideas will be
familiar to you if you are using Unix or Linux or Windows NT, but you may
need to adapt the program slightly for other operating systems. What
computer are you developing your work on?
So if you compile the program to a module called STEM, you can run it by typing
STEM source.text >stemmed.txt
(>stemmed.txt redirects stdout to the file stemmed.txt)
Reading through your note, it seems that the problem you are having is
encoding the stemming algorithm proper, which, although you describe it as
non-functional, is in fact a set of rules, very much like the Porter
stemmer, from which the algorithm might be coded up in a functional way. The
ANSI C version of the Porter stemmer might be a useful model to follow, but
that is only one of several approaches.
The impression I have is that you are a bit short of technical assistance in
this work, and it would be a pity if it got stuck when you have put so much
effort into the actual algorithm. Have the seen the stemming algorithm rules
at snowball.tartarus.org ? For example, see the pages
http://snowball.tartarus.org/french/stemmer.html
http://snowball.tartarus.org/russian/stemmer.html
If you could provide the algorithm in this exact form, plus a sample
vocabulary, I might be able to help you by developing a Snowball stemmer,
and could then send you an ANSI C module that did what you required. You
should however talk this idea over with your research supervisor.
I hope you will not mind if I post this email on snowball-discuss - other
replies can often be useful. I have edited out the sections that describe
your stemming rules, since this your own research work.
At 23:19 22/06/2003 -0700, lemma lessa wrote:
>
>Dear sir,
>
>I am student at Addis Ababa University, Ethiopia. I am doing a reserch on
developing a stemming algorithm for one of the local languages.My stemmer
follows an iterative approach. I used ANSI C to code my algorithm. I have
decided to adopt the Porter stemmer. But I faced problems pointed here under
and came to you hoping that you will help me. I prefer to present my
questions as follws:
>
>I have found the porter algorithm in ANSI C version but when I run it, it
responds that 'File Not Found'. Assuming that the name of the file to be
stemmed is "source.txt" where shall I save this file so that the program can
get it?
>
>After stemming the file, how does it save the stem dictionary?(By what name
and where?)
>
>How my stemmer works: My stemmer reads information from three files: suffix
file, stopword file and source file. It has three modules:one takes care of
all matters of the suffix file, the second one deals with stopwords, and the
third one deals with the actual suffix stripping task. It works in such a
way that it first reads unstemmed word from the source file; then it reads
entries from the stopword file and compare it with the word read from the
main file. If the word exists in the stopwords list, the program reads the
next word from the souce file. Otherwise, suffix file is opened and suffixes
are stripped, if any. Finally, conditions are checked against the final
resulting stem and necessary action is implemented, if applicable. (please,
see the conditions/actions below).
>
>Problem faced: The third module (the one that deals with the suffix
stripping) is not functional. Please, help me !!!!! as to how to adopt the
porter stemmer based on the conditions/actions given below. Assume the name
of the suffix file is 'Suffix.txt" and the stopword file is "stopword.txt".
>
>
>
>The only Conditions/rules considered by the stemmer
>
>The stemmer developed for stemming Wolaytta text is context sensitive one.
This decision is made mainly to get better performance result from the
stemmer. There are two context-sensitive actions employed in the stemmer in
process. These are:-
>
[Lemma Lessa's steming rules follow]
>Please, when this message arives at your desk, inform me back that it is
reached.
>
>For your favorable reply, I remain.
>
>
>
>Yours,
>
>Lemma