[Snowball-discuss] West Iberian queries

Martin Porter martin.f.porter at gmail.com
Fri Nov 6 06:10:06 GMT 2020


Victor, Olly,

I am sorry not to have responded to Victor's earlier email, but I had
not seen it before. (Just why is a bit of a mystery: I am certainly on
the snowball-discuss mailing list, it is very unlikely it got treated
as spam, in any case I check my 'spam' box regularly, but I have no
record of having received the email.)

Anyway, Olly's answers are certainly correct here. -arte endings might
come from a verb of the -artir type, comparte from compartir, reparte
from repartir, or might not be verbal, baluarte for example, which I
think means a defence work.

The stemmers were built up a few rules at a time. At each stage you
get list like the one at
http://snowball.tartarus.org/algorithms/spanish/diffs.txt

Introducing or altering a rule gives a new list. From the two lists
you derive, using some script, a list of lines like this,

word -- stemmed form -- new stemmed form

for all the words where the rule change gives rise to a different
stemmed form. Suppose there are 100 words in the list, where the new
stemmed form is better for 60 of the words and worse for 40 of the
words. Then you could say that the rule change has led to a 60%
inprovement. My own approach was only to accept a new rule or rule
change when the improvement was quite high: 90% or so. The question of
removing -te from verbal infinitives might be revisited, but a rule to
do it should be tested statistically in this way.

Martin



More information about the Snowball-discuss mailing list