[Snowball-discuss] West Iberian queries
    Martin Porter 
    martin.f.porter at gmail.com
       
    Fri Nov  6 06:10:06 GMT 2020
    
    
  
Victor, Olly,
I am sorry not to have responded to Victor's earlier email, but I had
not seen it before. (Just why is a bit of a mystery: I am certainly on
the snowball-discuss mailing list, it is very unlikely it got treated
as spam, in any case I check my 'spam' box regularly, but I have no
record of having received the email.)
Anyway, Olly's answers are certainly correct here. -arte endings might
come from a verb of the -artir type, comparte from compartir, reparte
from repartir, or might not be verbal, baluarte for example, which I
think means a defence work.
The stemmers were built up a few rules at a time. At each stage you
get list like the one at
http://snowball.tartarus.org/algorithms/spanish/diffs.txt
Introducing or altering a rule gives a new list. From the two lists
you derive, using some script, a list of lines like this,
word -- stemmed form -- new stemmed form
for all the words where the rule change gives rise to a different
stemmed form. Suppose there are 100 words in the list, where the new
stemmed form is better for 60 of the words and worse for 40 of the
words. Then you could say that the rule change has led to a 60%
inprovement. My own approach was only to accept a new rule or rule
change when the improvement was quite high: 90% or so. The question of
removing -te from verbal infinitives might be revisited, but a rule to
do it should be tested statistically in this way.
Martin
    
    
More information about the Snowball-discuss
mailing list