[Snowball-discuss] Slovene stemmer

Martin Porter martin.porter at grapeshot.co.uk
Tue Apr 19 14:56:55 BST 2005


Bostan,

I have, after a long absence, come back to the Snowball work and have been
looking at your stemmer. As promised, I have rewritten it to make proper use
of amongs. Here is the result, very much smaller and very much faster,


integers (
    p1
)

groupings (
    samoglasniki
    crke
    soglasniki
)

stringescapes {}

/* special characters (in ISO-8859-2) */

stringdef sv   hex 'B9'  // s-hacek
stringdef cv   hex 'E8'  // c-hacek
stringdef zv   hex 'BE'  // z-hacek

define crke         'abc{cv}defghijklmnoprs{sv}tuvz{zv}'
define samoglasniki 'aeiou'
define soglasniki   crke - samoglasniki

externals (
    stem
)

define stem as (
    $p1 = limit

    backwards  (
        do loop 4 (
            try ($p1>8
                [substring] among ('ovski' 'evski' 'anski' (delete))
            )
            try ($p1>7
                [substring] among ('stvo' '{sv}tvo' (delete))
            )
            $p1 = size
            try ($p1>6
                [substring] among (
                    '{sv}en' 'ski' '{cv}ek' 'ovm' 'ega' 'ovi' 'ijo' 'ija'
                    'ema' 'ste' 'ejo' 'ite' 'ila' '{sv}{cv}e' '{sv}ki'
                    'ost' 'ast' 'len' 'ven' 'vna' '{cv}an' 'iti' (delete))
            )
            $p1 = size
            try ($p1>6
                [substring] among (
                    'al' 'ih' 'iv' 'eg' 'ja' 'je' 'em' 'en' 'ev' 'ov' 'jo'
                    'ma' 'mi' 'eh' 'ij' 'om' 'do' 'o{cv}' 'ti' 'il' 'ec'
                    'ka' 'in' 'an' 'at' 'ir' (delete))
            )
            $p1 = size
            try ($p1>5
                [substring] among ('{sv}' 'm' 'c' 'a' 'e' 'i' 'o' 'u'
                    (delete))
            )
            $p1 = size
            try (($p1>6)  (
                [soglasniki] test soglasniki delete
                )
            )
            $p1 = size
            try ($p1>5
                [substring] among ('a' 'e' 'i' 'o' 'u' (delete))
            )
        )
    )
)

I have also assembled a Slovene vocabulary to try it out.

Now I can see the structure of the stemmer, I am surprised that it repeats
the suffix removal cycle four times. I notice that if I change 4 to 3, I get
a different result. I know this is not always an easy question to answer,
but can this be related to Slovene morphology in any way? The various
measures 8, 7, 6 etc applied to p1, were, I assume, arrived at by
experiment. Do you think using syllable measurement (as in the other
stemmers) might improve the result?

There are a few things I must ask you about. Much of the stemming looks very
nice. For example,

telovadbe                     telovad
telovadcem                    telovad
telovadcev                    telovad
telovadi                      telovad
telovadil                     telovad
telovaditi                    telovad
telovadne                     telovad
telovadni                     telovad
telovadno                     telovad
telovnik                      telovnik
tem                           tem
tema                          tema
temacna                       tema
temacni                       tema
temacno                       tema

But I am concerned that, with the character count approach, and the 'loop
4', the residual stems are very short. The following illustrates this,

sloven                        slo
slovenca                      slo
slovence                      slo
slovencem                     slo
slovencev                     slo
slovenci                      slo
slovencih                     slo
slovenec                      slo
slovenija                     slov
sloveniji                     slo
slovenijo                     slov
slovenko                      slo
slovenska                     slo
slovenske                     slo
slovenskega                   slo
slovenskem                    slo
slovenskemu                   slo
slovenski                     slov
slovenskih                    slo
slovenskim                    slo
slovenskimi                   slo
slovensko                     slo
slovenstva                    slo
sloven¹cina                   slo
sloven¹cini                   slo
sloven¹cino                   slo
sloven¹èina                   slo
sloven¹èini                   slo
sloven¹èino                   slo

Would not sloven (or slov), be a more desirable stem in this case?

Another point. I notice a common -ah suffix, which you have not removed, as
for example here,

besed                         besed
beseda                        besed
besedah                       besedah           <------------
besedam                       besed
besedami                      besed
besede                        besed
besedi                        besed
besedice                      besed
besedico                      besed
besedila                      besed
besedilmiran                  besed
besedilo                      besed
besedno                       besed

Could this be added to the list of suffixes?

Martin Porter





More information about the Snowball-discuss mailing list