[Snowball-discuss] Spanish stemmer with accents stripped before stemming

Martin Porter martin.porter at grapeshot.co.uk
Thu May 24 10:29:56 BST 2007


Andrew,

> It occurs to me that perhaps it would be a good idea to modify
> Snowball's Spanish stemmer to accept both accented and accent-stripped
> input.

I think that is a good point. As you say, "accents are the bogeyman of
day-to-day written Spanish usage and it's hard to imagine an effective
search engine that obliges users to type them correctly."

The occasion when I came across this problem before was news data in
Spanish where the placing of accents was very untrustworthy. There is a
variant of the Snowball German stemmer in which umlaut is represented by
following e, but there are no variants for the Romance language
stemmers.

I'm not sure what the deal is for Portuguese, but Spanish is as you
describe it. In French, the application of accents is quite rigorously
applied, except that they can be omitted when the text is entirely in
upper case. (But is that stylistic feature less prevalent than it was a
century ago? I'm not sure ...) Anyway, keeping accents in place with
French does not seem to be problematic.

Italian presents an interesting case. They use acute and grave, but not
by any consistent rule. There are different schemes for how acute/grave
is applied, which varies (or used to vary) among publishing houses. This
is why the Italian stemmer begins with the strange operation of
replacing all acutes with graves. A critical ending is then -o+accent,
but even if the accent is absent, -o is a similar ending, and will be
removed by the same rule (compare porto`, he carried, with porto, I
carry). The result is the the Italian stemmer does not behave very
differently on texts with all accents stripped.

We'll keep your suggestion in mind as a Snowball development.

Martin







More information about the Snowball-discuss mailing list