[Snowball-discuss] RE: Bug report?
Fri Oct 10 09:52:01 2003
It is a good idea to copy stemming ideas to
email@example.com, which is where all the lively
discussions take place!
I apologise for the confusion over the three forms of the English stemmer.
(It could be worse though.) The separation of the work to two different Web
address areas does not help. I will try to come up with some wording that
explains it more clearly.
In particular, the fact that "s" stems to null in the Porter stemmer with
its 1980 definition ought to mentioned on its web page.
I'll add this when I'm less busy with other work.
I agree with you about not stemming to the null string. The point with
Russian is that many of the endings are also stopwords that one might wish
to eliminate from an indexing process anyway. The first Russian stemmer I
did (more than 10 years ago now) took that approach.
At 15:33 10/10/2003 +0900, Alexander Gelbukh wrote:
>Thank you for your answer! The confusion was due to your page does discuss
>an "original" versus "improved" version, but seems not to indicate very
>clearly which one is which and where is each one. Perhaps you'd consider to
>make it VERY clear on your page for dummies like me.
>As to the discussion of Russian empty stems, it did not convince me. I
>wonder what they meant specifically: what are examples of Russian words with
>an empty stem? I know only one very arguable group of such words ("vynut'",
>"perenyat'", "zanyat'", ...) with arguably empty stem or the stem "-n-" (I
>guess historically there was a stem -n- (-im-?) followed by a suffix -n-,
>which then contracted together to one -n-). I cannot think of any other
>linguistically valid example.
>But even with this, I think it is a better choice not to allow empty stems
>by definition. Two arguments for this:
>- Technical: to alter the file format (two columns --> one column in some
>rows) or word count in a file can lead to subtle errors difficuly to detect,
>as it was in my case.
>- Pragmatic: The very purpose of a stemmer is to map "the same" words into
>one symbol but "different" ones into different symbols. This is prone to
>both types of errors: false alarms and misses. Mapping words to an empty
>stem harly can decrease the misses rate but probably will dramatically
>increase the false alarm rate. If this is indeed done for one group of
>words, perhaps it's wiser to map them into someting else, say, into one of
>them: "vynut'", "perenyat'", "zanyat'", ... --> "vynut'". Or to leave them
>alone, as you don't stem "be, are, is, was, were" into a common (empty?!)
>stem but just leave them alone.
>Thank you again for your attention!