[Snowball-discuss] Problem with spanish stemmer

Martin Porter martin.porter at grapeshot.co.uk
Wed Oct 31 09:29:04 GMT 2007


Ignacio,

What you have outlined should work, and I would have to look at your
approach in some detail to see where the problem lies. (Something
incidentally that I do not currently have the time the do!)

I have just done a simple test in which the line of suffixes,

'{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'

is additionally preceded by the line

'aramos' 'ieramos' 'iesemos' 'asemos'

and this works fine, your word "tomaramos" splitting as "tom-aramos".

So I can suggest that as an approach: supplement the algorithm with
extra endings, corresponding to the accented forms but with the accent
removed. I suggest you build it up bit by bit, and test it out as each
new ending, or set of endings, is included.

This problem has arisen before. See for example the email of Andrew
Green 19 May 2007.

Martin

On Mon, 2007-10-29 at 20:29 -0300, Ignacio Perez wrote:
> I'm working with the spanish stemmer and I'm having sort of a problem
> with the verb suffixes. The input I'm stemming is not orthographically
> perfect and I can not rely on the accents for stemming. I thought,
> then, I could remove all accents from my input and from the stemmer
> (for most of verb suffixes this does not represent a problem since
> "iéarmos", "íamos", "ábamos", "áramos", etc. are surely a suffix even
> when they're expressed as "iearmos", "iamos", "abamos", "aramos";
> there is no ambiguity). Surprisingly (for me) the stemmer did not
> behave as I expected and words like "tomaramos" were split
> "tomar-amos". 
> Evidently I'm not understanding the behaviour of the stemmer and these
> accents had more value for it. '{a'}ramos' 'i{e'}ramos' 'i{e'}semos'
> '{a'}semos'
> 
> So, how can I use the stemmer making it not accent-sensitive?
> 
> Thanks a lot
> 
> Ignacio




More information about the Snowball-discuss mailing list