[Snowball-discuss] West Iberian queries

Olly Betts olly at survex.com
Fri Nov 6 00:24:41 GMT 2020


Hi Victor,

Apologies for nobody responding to you until now.  Martin's the best
person to answer some of your questions, and I think I had hoped he'd
respond when this first arrived.

On Sat, Aug 29, 2020 at 08:04:02PM +0200, Victor wrote:
> I am quite interested in Snowball and its applications, and have been
> playing around with it lately. However, I've noticed there is currently no
> Galician algorithm and have found no reference to it being worked on. If
> that is indeed the case, I'd like to contribute to its development. It may
> take a while as I am fairly new to this, but I'm aware of the contributing
> advice file and instructions on the Github and website, so I think for now I
> should be set. Please do let me know if there is anything else I should be
> aware of.

Nobody's working on Galician for Snowball as far as I know.

There is some existing work on stemming Galician in the academic
literature, which might be a useful place to start, e.g.:

https://link.springer.com/chapter/10.1007/3-540-45735-6_9

> Also, I was wondering why second person verb-attached pronouns are
> completely ignored in Spanish, ie. _saludarte_ incorrectly renders
> _saludart_, when it should be _salud_ as when you input _saludarme_.
> Similarly, -selo/a(s) is accounted for but not in the first or second
> persons ('melo' 'mela' 'melos' 'melas' 'telo' 'tela' 'telos' 'telas'). Is
> this due to their rarity? I would be surprised, but please let me know if
> that is the case.

Martin Porter wrote the Spanish stemmer originally, so I was hoping he
could provide insights here.

https://snowballstem.org/algorithms/spanish/stemmer.html is a
description of the algorithm, but that doesn't cover the "why" of what
it does.

And there isn't anything on the "Romance language stemmers" page to
help here either (https://snowballstem.org/algorithms/romance.html).

Typically deliberate omissions are because it's hard to come up
with a rule which handles such cases without causing overstemming in
other cases (e.g. words which end "-te" where that isn't an attached
pronoun), or because they would add a lot of complexity without
improving retrieval results (e.g. the English stemmer doesn't attempt to
conflate all the different forms of irregular verbs like "to be").

It could also just be an oversight.

Given similar forms are handled, it seems this should be if it can be
without causing problems for other words.

Cheers,
    Olly



More information about the Snowball-discuss mailing list