[Snowball-discuss] Romanian stemmer

Martin Porter martin.porter at grapeshot.co.uk
Wed Sep 6 12:39:16 BST 2006


To the originators of the Romanian stemmers,

I have now found time to do some preliminary work on the Romanian stemmer. I
should explain that part of the complication has been the receipt, no more
than ten days apart, of two Romanian stemmers in Snowball, the first
(romanian1) from Stegarescu and others in Heidelberg, the second (romanian2)
from Tirdea in Bucharest (see the original emails below.)

For the time being both stemmers are in place at

http://snowball.tartarus.org/algorithms/romanian1/stemmer.html
http://snowball.tartarus.org/algorithms/romanian2/stemmer.html

These pages are not currently linked to, and not all links from these pages
work, but the following should be looked at:

http://snowball.tartarus.org/algorithms/romanian1/diffs.txt

(with the character encoding set to Unicode of course). I have put together a
vocabulary by combining the vocabularies provided with romanian1 and
romanian2. This appears in column 1. Column 2 is the stemmed form produced by
romanian1, and column 3 the stemmed form produced by romanian2. If the entry
in column 3 is blank, both stemmers are producing the same result.

You might care to compare the two approaches.

My own feeling is that romanian1 does a more thorough job of ending removal,
but unlike romanian2 has a habit of discarding too much from short words.
aberant->ab, abatere->ab, aburi->ab are examples of this. In romanian1 the R2
test is rarely used (it seems to me that 'R1 or R2' is equivalent to 'R1',
since p2 is never to the left of p1.)

I might have a go at making some modifications here. Needless to say, I am
not familiar with Romanian, but the similarity to the other Romance
languages, especially Italian, enables one to grasp the essential features of
the morphology.

What we would like to do is to have a single stemmer for release from the
snowball site, if that is possible, and giving all necessary credits, along
the lines of the recent addition,

http://snowball.tartarus.org/algorithms/hungarian/stemmer.html

Hope to hear from you,

Martin Porter



18 Jul 2006

Dear Mr. Porter, dear Mr. Boulton,

Department of Computational Linguistics,
Ruprecht-Karls-University of Heidelberg

Marina Stegarescu: mstegare at hotmail.com
Doina Gliga: doina_gliga at yahoo.co.uk
Erwin Glockner: eglockner at hotmail.com


we finally finished the Romanian stemmer. Unfortunately evaluation took
more time than expected.
However, it was an interesting experience creating the stemmer, and we
are happy to send you the result of our work.
The attachment-file is a Tarball-zipped file with (hopefully) all files
needed. The files and the stemmer as well are encoded in UTF-8. Please
inform us if something is missing.

We would be happy if the Romanian stemmer would be accepted and
integrated into the official Snowball distribution. We agree of course
to license the stemmer under the same terms as the existing snowball
software.

We're looking forward to hear from you soon.


With kind regards,

Marina, Doina and Erwin.


28 Jul 2006

Irina Tirdea: irina.tirdea at gmail.com
Hello,

My name is Irina Tirdea and I have developed a Romanian stemmer in Snowball
as part of my bachelor thesis, in Bucharest, Romania. I am sending you the
code attached (with vocabulary and stop word list files) and I hope you will
accept and integrate it as a part of the Snowball project. I am ready to
release the stemmer under the BSD license, just as the Snowball software.
The files have been written in UTF-8 encoding (on a Linux system).

Looking forward to hear from you.

Kind regards,
Irina Tirdea





More information about the Snowball-discuss mailing list