[Snowball-discuss] Spanish stemmer with accents stripped before stemming

Andrew Green ndrw_grn at yahoo.com.mx
Wed May 23 00:09:24 BST 2007


Hi, Martin,

Thank you very much for your replies...

>Obviously to us, it a bit easier to look at the problem
> from the snowball angle, rather than think about the generated java
> after it's been put inside lucene! As far as the snowball script is
> concerned, I believe you could strip out accents from the source,
> eliminate the duplicate strings in the amongs(..) that would result, and
> recompile, getting the effect you want.

OK... We just tried that, and it works very well so far! Thanks a bunch.

> (Incidentally, I have hit this problem with Spanish stemming before, but
> it was a long while ago -- before the development of snowball.)

Accents are the boogeyman of day-to-day written Spanish usage and it's
hard to imagine an effective search engine that obliges users to type
them correctly.

> Thinking about it further, this will not work, since the strings are
> placed in tables which would need to be fully reorganised if any of
> the
> characters in the strings were readjusted. (it is the way a snowball
> 'among' is implemented).
> 
> The only way to do this is to modify the stem.sbl file for Spanish,
> regenerate the java code with the snowball compiler (which you can
> download) and replace the old java with the new in your application.

Ah... So now we know why our previous attempt failed.

It occurs to me that perhaps it would be a good idea to modify
Snowball's Spanish stemmer to accept both accented and accent-stripped
input.

Greetings,
Andrew Green

P.S. Our little server was down during the weekend--sorry--it's back
online again--though now a commit error to the repository has made the
relevant files difficult to access--though now they seem less relevant,
I suppose.




More information about the Snowball-discuss mailing list