[Snowball-discuss] Slovene stemmer
Martin Porter
martin.porter at grapeshot.co.uk
Tue Apr 19 14:56:55 BST 2005
Bostan,
I have, after a long absence, come back to the Snowball work and have been
looking at your stemmer. As promised, I have rewritten it to make proper use
of amongs. Here is the result, very much smaller and very much faster,
integers (
p1
)
groupings (
samoglasniki
crke
soglasniki
)
stringescapes {}
/* special characters (in ISO-8859-2) */
stringdef sv hex 'B9' // s-hacek
stringdef cv hex 'E8' // c-hacek
stringdef zv hex 'BE' // z-hacek
define crke 'abc{cv}defghijklmnoprs{sv}tuvz{zv}'
define samoglasniki 'aeiou'
define soglasniki crke - samoglasniki
externals (
stem
)
define stem as (
$p1 = limit
backwards (
do loop 4 (
try ($p1>8
[substring] among ('ovski' 'evski' 'anski' (delete))
)
try ($p1>7
[substring] among ('stvo' '{sv}tvo' (delete))
)
$p1 = size
try ($p1>6
[substring] among (
'{sv}en' 'ski' '{cv}ek' 'ovm' 'ega' 'ovi' 'ijo' 'ija'
'ema' 'ste' 'ejo' 'ite' 'ila' '{sv}{cv}e' '{sv}ki'
'ost' 'ast' 'len' 'ven' 'vna' '{cv}an' 'iti' (delete))
)
$p1 = size
try ($p1>6
[substring] among (
'al' 'ih' 'iv' 'eg' 'ja' 'je' 'em' 'en' 'ev' 'ov' 'jo'
'ma' 'mi' 'eh' 'ij' 'om' 'do' 'o{cv}' 'ti' 'il' 'ec'
'ka' 'in' 'an' 'at' 'ir' (delete))
)
$p1 = size
try ($p1>5
[substring] among ('{sv}' 'm' 'c' 'a' 'e' 'i' 'o' 'u'
(delete))
)
$p1 = size
try (($p1>6) (
[soglasniki] test soglasniki delete
)
)
$p1 = size
try ($p1>5
[substring] among ('a' 'e' 'i' 'o' 'u' (delete))
)
)
)
)
I have also assembled a Slovene vocabulary to try it out.
Now I can see the structure of the stemmer, I am surprised that it repeats
the suffix removal cycle four times. I notice that if I change 4 to 3, I get
a different result. I know this is not always an easy question to answer,
but can this be related to Slovene morphology in any way? The various
measures 8, 7, 6 etc applied to p1, were, I assume, arrived at by
experiment. Do you think using syllable measurement (as in the other
stemmers) might improve the result?
There are a few things I must ask you about. Much of the stemming looks very
nice. For example,
telovadbe telovad
telovadcem telovad
telovadcev telovad
telovadi telovad
telovadil telovad
telovaditi telovad
telovadne telovad
telovadni telovad
telovadno telovad
telovnik telovnik
tem tem
tema tema
temacna tema
temacni tema
temacno tema
But I am concerned that, with the character count approach, and the 'loop
4', the residual stems are very short. The following illustrates this,
sloven slo
slovenca slo
slovence slo
slovencem slo
slovencev slo
slovenci slo
slovencih slo
slovenec slo
slovenija slov
sloveniji slo
slovenijo slov
slovenko slo
slovenska slo
slovenske slo
slovenskega slo
slovenskem slo
slovenskemu slo
slovenski slov
slovenskih slo
slovenskim slo
slovenskimi slo
slovensko slo
slovenstva slo
sloven¹cina slo
sloven¹cini slo
sloven¹cino slo
sloven¹èina slo
sloven¹èini slo
sloven¹èino slo
Would not sloven (or slov), be a more desirable stem in this case?
Another point. I notice a common -ah suffix, which you have not removed, as
for example here,
besed besed
beseda besed
besedah besedah <------------
besedam besed
besedami besed
besede besed
besedi besed
besedice besed
besedico besed
besedila besed
besedilmiran besed
besedilo besed
besedno besed
Could this be added to the list of suffixes?
Martin Porter
More information about the Snowball-discuss
mailing list