[Snowball-discuss] Doubt about the portuguese stem

Martin Porter martin at porterloo.wanadoo.co.uk
Fri Aug 21 11:00:00 BST 2009


Leonardo,

Hello. Yes this point often comes up (well not that often .. perhaps once
every two years). You can search on the gmane archive of Snowball discuss,
looking for "andrew green". In summary, here is a note I wrote 19 May 2007,
on the subject of accents in Spanish, and the problem of their omission,



The occasion when I came across this problem before was news data in
Spanish where the placing of accents was very untrustworthy. There is a
variant of the Snowball German stemmer in which umlaut is represented by
following e, but there are no variants for the Romance language
stemmers.

I'm not sure what the deal is for Portuguese, but Spanish is as you
describe it. In French, the application of accents is quite rigorously
applied, except that they can be omitted when the text is entirely in
upper case. (But is that stylistic feature less prevalent than it was a
century ago? I'm not sure ...) Anyway, keeping accents in place with
French does not seem to be problematic.

Italian presents an interesting case. They use acute and grave, but not
by any consistent rule. There are different schemes for how acute/grave
is applied, which varies (or used to vary) among publishing houses. This
is why the Italian stemmer begins with the strange operation of
replacing all acutes with graves. A critical ending is then -o+accent,
but even if the accent is absent, -o is a similar ending, and will be
removed by the same rule (compare porto`, he carried, with porto, I
carry). The result is the the Italian stemmer does not behave very
differently on texts with all accents stripped.

--- end of quote


The answer for Spanish is to extend the ending list to include unaccented
forms, again see correspondene with "ignacio perez",



I have just done a simple test in which the line of suffixes,

'{a'}ramos' 'i{e'}ramos' 'i{e'}semos' '{a'}semos'

is additionally preceded by the line

'aramos' 'ieramos' 'iesemos' 'asemos'

and this works fine, your word "tomaramos" splitting as "tom-aramos".

So I can suggest that as an approach: supplement the algorithm with
extra endings, corresponding to the accented forms but with the accent
removed. I suggest you build it up bit by bit, and test it out as each
new ending, or set of endings, is included.

---- end of quote


I think Portuguese is more difficult, because the 'tilde' indicates a
consonant value, and this is reflected in the stemming algorithm. It seems
to me okay to stem sa~o to sa~ (forgive the limitations of my keyboard), but
not to stem sa~o to sa, or to stem sao to sa. Of course, English speakers
typically write 'Sao Paolo', but that is unfortunate, and the result of
them, like me, not having the accented 'a' on the keyboard. I do not know
the answer to this: any advice would be helpful,


Martin


At 10:31 PM 8/19/2009 +0200, Leonardo Borges wrote:
>Hello guys,
>
>I am currently evaluating Sphinx as an option for my projects and, since I
>am brazilian, wanted to give it a try to the Portuguese stemmer you guys
>provide.
>
>Thus, I compiled sphinx with the libstemmer option and everything went
>great.
>
>Given the following phrase, in one of my documents: "Então, vamos começar a
>usar libstemmer"
>
>The following searches return the correct document:
>"Então", "Entã", "então", "entã"
>which is great, but if I search for:
>"Entao"
>It returns nothing.
>






More information about the Snowball-discuss mailing list