Re[2]: [Snowball-discuss] an inconsistency with Russian stemmer

Andrew Aksyonoff Andrew Aksyonoff <shodan@chat.ru>
Fri, 16 Nov 2001 16:04:17 +0300


Hello Martin!

Friday, November 16, 2001, 12:50:45 PM, you wrote:
MP> Andrew, I'm posting this to the discussion list and copying the email t=
o you.
Thanks for the fast reply. I'll do the same.

MP> There is of course no need to implement the stemmer in C, since a stemm=
er in
MP> C, generated from the Snowball script, is provided on the site. Were yo=
u not
MP> aware of that, or did you prefer to develop one yourself?
I was, but I decided to implement my own one - due to the
algorithm being very simple, it seemeed easier than even linking
with provided sources. There was other reasons, too (to get
acquainted with the stemming algorithms, to try improving
this one, to test implementations for speed, etc).

MP> In any case, I
MP> cannot see how to fix your problem from the information you have sent m=
e.
MP> You would need to send me the sources of the program you have developed=
.
Surely. Here goes (in KOI-8, if this is unreadable, I'll
send you a version where all Russian letters are replaced
with hex numbers):

--- cut ---
#include <stdio.h>

typedef struct {
        unsigned char suffix[8];
        int           remove;
} stem_table;

static unsigned char stem_ru_letters[] =3D "=C1=C2=D7=C7=C4=C5=D6=DA=C9=CA=
=CB=CC=CD=CE=CF=D0=D2=D3=D4=D5=C6=C8=C3=DE=DB=DD=DF=D9=D8=DC=C0=D1";
static unsigned char stem_ru_vowels[] =3D "=C1=C5=C9=CF=D5=D9=DC=C0=D1";

static stem_table ru_gerund[] =3D {
        { "=C1=D7",     1 },
        { "=C1=D7=DB=C9",   3 },
        { "=C1=D7=DB=C9=D3=D8", 5 },

        { "=D1=D7",     1 },
        { "=D1=D7=DB=C9",   3 },
        { "=D1=D7=DB=C9=D3=D8", 5 },

        { "=C9=D7",     2 },
        { "=C9=D7=DB=C9",   4 },
        { "=C9=D7=DB=C9=D3=D8", 6 },

        { "=D9=D7",     2 },
        { "=D9=D7=DB=C9",   4 },
        { "=D9=D7=DB=C9=D3=D8", 6 },
};

static stem_table ru_adj[] =3D {
        { "=C5=C5",  2 },
        { "=C9=C5",  2 },
        { "=D9=C5",  2 },
        { "=CF=C5",  2 },
        { "=C9=CD=C9", 3 },
        { "=D9=CD=C9", 3 },
        { "=C5=CA",  2 },
        { "=C9=CA",  2 },
        { "=D9=CA",  2 },
        { "=CF=CA",  2 },
        { "=C5=CD",  2 },
        { "=C9=CD",  2 },
        { "=D9=CD",  2 },
        { "=CF=CD",  2 },=20
        { "=C5=C7=CF", 3 },
        { "=CF=C7=CF", 3 },
        { "=C5=CD=D5", 3 },
        { "=CF=CD=D5", 3 },
        { "=C9=C8",  2 },
        { "=D9=C8",  2 },
        { "=D5=C0",  2 },
        { "=C0=C0",  2 },
        { "=C1=D1",  2 },
        { "=D1=D1",  2 },
        { "=CF=C0",  2 },
        { "=C5=C0",  2 }
};

static stem_table ru_part[] =3D {
        { "=C1=C5=CD", 2 },
        { "=C1=CE=CE", 2 },
        { "=C1=D7=DB", 2 },
        { "=C1=C0=DD", 2 },
        { "=C1=DD",  1 },
        { "=D1=C5=CD", 2 },
        { "=D1=CE=CE", 2 },
        { "=D1=D7=DB", 2 },
        { "=D1=C0=DD", 2 },
        { "=D1=DD",  1 },
        { "=C9=D7=DB", 3 },
        { "=D9=D7=DB", 3 },
        { "=D5=C0=DD", 3 }
};

static stem_table ru_reflex[] =3D {
        { "=D3=D1", 2 },
        { "=D3=D8", 2 }
};

static stem_table ru_verb[] =3D {
        { "=C1=CC=C1",  2 },
        { "=C1=CE=C1",  2 },
        { "=C1=C5=D4=C5", 3 },
        { "=C1=CA=D4=C5", 3 },
        { "=C1=CC=C9",  2 },
        { "=C1=CA",   1 },
        { "=C1=CC",   1 },
        { "=C1=C5=CD",  2 },
        { "=C1=CE",   1 },
        { "=C1=CC=CF",  2 },
        { "=C1=CE=CF",  2 },
        { "=C1=C5=D4",  2 },
        { "=C1=C0=D4",  2 },
        { "=C1=CE=D9",  2 },
        { "=C1=D4=D8",  2 },
        { "=C1=C5=DB=D8", 3 },
        { "=C1=CE=CE=CF", 3 },
        { "=D1=CC=C1",  2 },
        { "=D1=CE=C1",  2 },
        { "=D1=C5=D4=C5", 3 },
        { "=D1=CA=D4=C5", 3 },
        { "=D1=CC=C9",  2 },
        { "=D1=CA",   1 },
        { "=D1=CC",   1 },
        { "=D1=C5=CD",  2 },
        { "=D1=CE",   1 },
        { "=D1=CC=CF",  2 },
        { "=D1=CE=CF",  2 },
        { "=D1=C5=D4",  2 },
        { "=D1=C0=D4",  2 },
        { "=D1=CE=D9",  2 },
        { "=D1=D4=D8",  2 },
        { "=D1=C5=DB=D8", 3 },
        { "=D1=CE=CE=CF", 3 },

        { "=C9=CC=C1",  3 },
        { "=D9=CC=C1",  3 },
        { "=C5=CE=C1",  3 },
        { "=C5=CA=D4=C5", 4 },
        { "=D5=CA=D4=C5", 4 },
        { "=C9=D4=C5",  3 },
        { "=C9=CC=C9",  3 },
        { "=D9=CC=C9",  3 },
        { "=C5=CA",   2 },
        { "=D5=CA",   2 },
        { "=C9=CC",   2 },
        { "=D9=CC",   2 },
        { "=C9=CD",   2 },
        { "=D9=CD",   2 },
        { "=C5=CE",   2 },
        { "=C9=CC=CF",  3 },
        { "=D9=CC=CF",  3 },
        { "=C5=CE=CF",  3 },
        { "=D1=D4",   2 },
        { "=D5=C5=D4",  3 },
        { "=D5=C0=D4",  3 },
        { "=C9=D4",   2 },
        { "=D9=D4",   2 },
        { "=C5=CE=D9",  3 },
        { "=C9=D4=D8",  3 },
        { "=D9=D4=D8",  3 },
        { "=C9=DB=D8",  3 },
        { "=D5=C0",   2 },
        { "=C0",    1 }
};

static stem_table ru_noun[] =3D {
        { "=C1",    1 },
        { "=C5=D7",   2 },
        { "=CF=D7",   2 },
        { "=C9=C5",   2 },
        { "=D8=C5",   2 },
        { "=C5",    1 },
        { "=C9=D1=CD=C9", 4 },
        { "=D1=CD=C9",  3 },
        { "=C1=CD=C9",  3 },
        { "=C5=C9",   2 },
        { "=C9=C9",   2 },
        { "=C9",    1 },
        { "=C9=C5=CA",  3 },
        { "=C5=CA",   2 },
        { "=CF=CA",   2 },
        { "=C9=CA",   2 },
        { "=C9=D1=CD",  3 },
        { "=D1=CD",   2 },
        { "=C9=C5=CD",  3 },
        { "=C1=CD",   2 },
        { "=CF=CD",   2 },
        { "=CF",    1 },
        { "=D5",    1 },
        { "=C1=C8",   2 },
        { "=C9=D1=C8",  3 },
        { "=D1=C8",   2 },
        { "=D9",    1 },
        { "=D8",    1 },
        { "=C9=C0",   2 },
        { "=D8=C0",   2 },
        { "=C0",    1 },
        { "=C9=D1",   2 },
        { "=D8=D1",   2 },
        { "=D1",    1 }
};

static stem_table ru_super[] =3D {
        { "=C5=CA=DB",  3 },
        { "=C5=CA=DB=C5", 4 }
};

static stem_table ru_deriv[] =3D {
        { "=CF=D3=D4",  3 },
        { "=CF=D3=D4=D8", 4 }
};

int stem_ru_iv(unsigned char l)
{
        register unsigned char *v =3D stem_ru_vowels;

        while (*v && *v !=3D l) v++;
        return (*v =3D=3D l) ? 1 : 0;
}

int stem_ru_table(unsigned char *word, int *len, stem_table *table, int nta=
ble)
{
        int i, j, k;

        for (i =3D 0; i < ntable; i++) {
                j =3D strlen(table[i].suffix)-1; // FIXME!!!
                k =3D (*len)-1;
                if (j > k) continue;
                for (; j >=3D 0; k--, j--)
                        if (word[k] !=3D table[i].suffix[j]) break;
                if (j >=3D 0) continue;

                *len -=3D table[i].remove;
                return 1;
        }
        return 0;
}

#define STEM_RU_FUNC(func,table) \
        int func(char *word, int *len) \
        { \
                return stem_ru_table(word, len, \
                        table, sizeof(table) / sizeof(stem_table)); \
        }

STEM_RU_FUNC(stem_ru_gerund, ru_gerund)
STEM_RU_FUNC(stem_ru_adj,    ru_adj)
STEM_RU_FUNC(stem_ru_part,   ru_part)
STEM_RU_FUNC(stem_ru_reflex, ru_reflex)
STEM_RU_FUNC(stem_ru_verb,   ru_verb)
STEM_RU_FUNC(stem_ru_noun,   ru_noun)
STEM_RU_FUNC(stem_ru_super,  ru_super)
STEM_RU_FUNC(stem_ru_deriv,  ru_deriv)

int stem_ru_adjectival(unsigned char *word, int *len)
{
        if (stem_ru_adj(word, len)) {
                stem_ru_part(word, len);
                return 1;
        }
        return 0;      =20
}

int stem_ru_verbal(unsigned char *word, int *len)
{
        if (stem_ru_reflex(word, len)) {
                if (stem_ru_verb(word, len)) return 1;
                if (stem_ru_adjectival(word, len)) return 1;
                if (stem_ru_noun(word, len)) return 1;
                return 1;
        }
        return stem_ru_verb(word, len);
}

void stem_ru(unsigned char *word)
{
        int end, rv, r1, r2;
        int i, len;

        len =3D strlen(word);
        rv =3D r1 =3D r2 =3D len;
        for (i =3D 0; i < len; i++)
                if (stem_ru_iv(word[i])) { rv =3D i+1; break; }
        if (rv =3D=3D len) return;

        for (i =3D 0; i < len-1; i++)
                if (stem_ru_iv(word[i]) && !stem_ru_iv(word[i+1])) { r1 =3D=
 i+2; break; }
        for (i =3D r1; i < len-1; i++)
                if (stem_ru_iv(word[i]) && !stem_ru_iv(word[i+1])) { r2 =3D=
 i+2; break; }

        word +=3D rv;
        len -=3D rv;
        r1 -=3D rv;
        r2 -=3D rv;

        while (1) {
                if (stem_ru_gerund(word, &len)) break;
                if (stem_ru_adjectival(word, &len)) break;
                if (stem_ru_verbal(word, &len)) break;
                if (stem_ru_noun(word, &len)) break;
                break;
        }

        if (len > 0 && (word[len-1] =3D=3D '=CA' || word[len-1] =3D=3D '=C9=
')) len--;

        len -=3D r2;
        stem_ru_deriv(word+r2, &len);
        len +=3D r2;

        stem_ru_super(word, &len);
        if (len > 1 && word[len-2] =3D=3D '=CE' && word[len-1] =3D=3D '=CE'=
) len--;
        if (word[len-1] =3D=3D '=D8') len--;

        word[len] =3D 0;
}

void main()
{
        unsigned char buf[256];

        while (fgets(buf, sizeof(buf), stdin)) {
                if (buf[strlen(buf)-1] =3D=3D '\n') buf[strlen(buf)-1] =3D =
0;
                stem_ru(buf);
                printf("%s\n", buf);
        }
}
--- cut ---

Note: I do in-place stemming, so stem_ru_*() functions return 1
if there was any stemming, 0 otherwise, and actual "stemming"
is done by simply adjusting "len" variable.

First problem is as follows.

My implementation (shorthand: MY) is

int stem_ru_verbal(unsigned char *word, int *len)
{
        if (stem_ru_reflex(word, len)) {
                if (stem_ru_verb(word, len)) return 1;
                if (stem_ru_adjectival(word, len)) return 1;
                if (stem_ru_noun(word, len)) return 1;
                return 1;
        }
        return stem_ru_verb(word, len);
}

while I believe both the explanation and Snowball
source mean something like (shorthand: ORIG)

int stem_ru_verbal(unsigned char *word, int *len)
{
        int save =3D *len;
       =20
        if (stem_ru_reflex(word, len)) {
                if (stem_ru_verb(word, len)) return 1;
                if (stem_ru_adjectival(word, len)) return 1;
                *len =3D save; // this undoes stem_ru_reflex(word, len)
        }
        return stem_ru_verb(word, len);
}

The difference can be seen on the following examples:

"avos'": MY gives "av", ORIG gives "avos"
"bereglas'": MY gives "beregl", ORIG gives "bereglas"

Second problem (not only trailing "i" but "i'" should be
stemmed on step 2) can be seen on

"zmei'": MY gives "zme", ORIG gives "zmei'"
"znai'": MY gives "zna", ORIG gives "znai'"

In general, the problems are as follows:

1) MY implementation gives results matching perfectly
   with output.txt (which it should not);
2) ORIG implementation in turn does not (which it should);
3) Performing the algorithm by hand, I receive the
   very same results as with ORIG.

Thus, I'd be very grateful if you would show me the sequence
of stemming actions one should take according to the currently
published algorithm to reduce "zmei'" to "zme" and "bereglas'"
to "beregl" - I'll be able to find my error from that.

A simple example of such sequence:

word: vazhnei'shimi
Step 1: remove adjectival "imi"
Step 2: do nothing
Step 3: do nothing
Step 4: remove superlative "ei'sh"

A sequence I keep up getting for "bereglas'":

word: bereglas'
Step 1: remove noun ending "'"
Step 2: do nothing
Step 3: do nothing
Step 4: do nothing

The trouble here is that "gla" which precedes reflexive
"s'" and thus should be either a verb or adjectival
does not fit there. Thus, "s'" is treated as a noun
ending "'".

I hope my explanation of what's going on clear enough.

MP> P.S. Not related to the great Vassili Aksyonov, I suppose?
No, this surname is widespread enough in Russia.

- Andrew


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________