[Snowball-discuss] French stemmer, Snowball project

Mon Sep 9 12:24:01 2002

[This masseage to Fred Brault ought to be posted on snowball-discuss, as it
give the background to some recent changes to the Snowball site. It was
originally sent 1 Sept 02 - Martin]

Fred,

I've looked at the algorithm alongside your comments, and can now report=
 back.

Things aren't as bad as I at first thought ...

A) The rule for marking u and i/y as consonants is not explained at all
well, and will have to be improved. The point is that it works from left to
right along the word. So in risquiez (to quote one of your examples), i is
vowel because it is between consonants, u is a consonant because it is after
q, and the following i is a vowel because it is after U (u as a consonant)
and before i. So the word becomes risqUiez, and not, as you have it,=
 risquIez.

If you think about it, the i should not be treated as a consonant in this=
 case.

This accounts for most of the differences you noted.

B) There is a clear slip in the description of the algorithm, which you have
spotted. For 'eus' before 'ement', it should read: "delete if in R2, or
replace by 'eux' if in R1". I will put that right. ('ement' is tricky
because it is a single form for two endings - see the table of endings for
the Romance languages.)

This explains the 'pieusement' example. I realise the algorithm is failing
to equate pieuse with pieux, but that is a short-stem problem (see below).

But the treatment of -ier etc in step 4 is correct. Remember that the test
operates in RV (this is declared at the front of the step). In crier, RV is
just [er], not [ier]. I realise that -er is an ending in crier, but allowing
for very short stems like 'cri' leads to too many errors generally.=20

This is a problem with all the Romance language stemmers, and the definition
of RV, rather complicated for some of them, is trying to get the balance
just right. It is because there are in all these languages certain verbs
with very short stems. crier, prier, rier etc. I have occasionally tried to
establish lists of such verbs, but it is not easy.

C) Now for your final suggestion, to respell i{e`}r as i{e`}re etc after
ement removal. This doesn't quite work since iere is only removed later if
nothing was done in the step that removed ement, but you can get the effect
by just replacing i{e`r} in RV with i, following ement removal.

This leads to the following pattern of changes:

familier                      famili
famili=E8rement                 familier      -> famili

financier                     financi
financi=E8rement                financier     -> financi

fonci=E8rement                  foncier       -> fonci

grossi=E8re                     grossi
grossi=E8rement                 grossier      -> grossi
grossi=E8res                    grossi

irr=E9guli=E8re                   irr=E9guli
irr=E9guli=E8rement               irr=E9gulier    -> irr=E9guli
irr=E9guliers                   irr=E9guli

particuli=E8re                  particuli
particuli=E8rement              particulier   -> particuli
particuliers                  particuli

premier                       premi
premi=E8re                      premi
premi=E8rement                  premier       -> premi
premi=E8res                     premi
premiers                      premi

r=E9gulier                      r=E9guli
r=E9guli=E8re                     r=E9guli
r=E9guli=E8rement                 r=E9gulier      -> r=E9guli

singulier                     singuli
singuli=E8re                    singuli
singuli=E8rement                singulier     -> singuli
singuli=E8res                   singuli
singuliers                    singuli

- a definite improvement so I will put it in. The change in the Snowball
script is

                try (
                    [substring] among(
                        'iv' (R2 delete ['at'] R2 delete)
                        'eus' ((R2 delete) or (R1<-'eux'))
                        'abl' 'iqU' (R2 delete)
                        'i{e`}r'          <---new
                        'I{e`}r'          <---new
                            (RV <-'i')    <---new
                    )
                )
            )

The case I{e`}r should be included here, and yet there are no words in the
sample vocabulary that illustrate it. The question is, can you think of a
word ending Vie`rement in French, where V is a vowel!?

-------------

I will add (A), (B), (C) in soon. Right now the website is being reorganised
(possibly going to a new server), but when Richard Boulton has finished that
I will put the changes in place.

Martin

At 12:19 PM 8/31/02 -0400, FREDERICK BRAULT wrote:
>Content-Type: text/plain; charset=3D"iso-8859-1"

>X-MIME-Autoconverted: from 8bit to quoted-printable by agora.ulaval.ca id
g7VGIWg06132
>
>
>Dear Mr. Porter, dear Mr. Boulton,
>
>I implemented the French stemmer that you suggest through the Snowball=20
>project and it fonctions well indeed. However, I identified little=20
>inconsistencies that I wanted to share with you to contribute to the=20
>improvement of the Snowball project.=20
>
>I suspect there are few errors in the script that generated the list of=20
>French words and stems that is provided by the Snowball project in order=20
>to check the efficiency of any other implementation of the stemmer=20
>algorithm (or maybe it is just errors in the list itself). The errors=20
>occur with the suffixes =93ier, i=E8re, Ier and I=E8re=94 of step 4.=
 According to=20
>the algorithm, these suffixes should be replaced by =93i=94 but aren=92t in=
 the=20
>list (which is reproduced in the attached file. Open it in Microsoft=20
>Paint if you can't see it well). In the list below, I=20
>suspect that these suffixes didn=92t work and that the next operation in=20
>step 4 (with the suffix =91e=92) was carried on and then, by step 6, the=20
>remaining accent was removed. It is important not to miss step 4
>because then, for example, the masculine =93entier=94 is not associated=
 with=20
>its feminine counterpart =93enti=E8re=94, as it can be seen in the list=20
>attached.
>
>Another error in the list is the word =93pieusement=94 (also reproduced in=
=20
>the attached file) that should give =93pieux=94 by vitue of step 1 in=20
>the =93else=94 part of the rule of the =93ement and ements=94 suffixes. In=
 the=20
>list below, the =93else=94 part wasn=92t executed and then gave =93pieus=94=
.=20
>
>I would also suggest to add the suffix =93Ie=94 in step 2a. Because of the=
=20
>definition of the vowels, such words as =93=E9vanouie=94 and =93r=E9jouie=
=94 (not in=20
>the attached file) give =93=E9vanoui=94 and =93 r=E9joui=94 which are not=
 grouped=20
>then with the other words in the same class (=E9vanoui, =E9vanouie,=
 =E9vanouir,=20
>=E9vanouirent, =E9vanouis, =E9vanouissait, =E9vanouissement, =E9vanouit)=
 that=20
>give =93=E9vanou=94 and =93r=E9jou=94. The problem with the actual suffixes=
 is that=20
>the =93u=94 and =93i=94 get upper cased because they are between vowels=20
>giving =93=E9vanoUIe=94 and =93r=E9joUIe=94. By adding the suffix =93Ie=94=
 in step 2a,=20
>the problem is solved.=20
>
>Another suggestion is to add the suffix =93Iez=94 in step 2b along with =93=
=E9,=20
>=E9e, =E9es, =E9s, ... , ez, iez=94. Some words like =93risquiez, renvoyiez=
 and=20
>payiez=94 give =93risqUIez, renvoyIez and payIez=94 because the =93i=94 is=
 between=20
>vowels. Maybe I got the rules for manipulating the vowels wrong or the=20
>suffix =93Iez=94 has been forgotten in the description of the algorithm=20
>because in the =93checking=94 list provided by the Snowball project, the=20
>words =93risqUIez, renvoyIez and payIez=94 are correctly stemmed.
>
>Finally, another suggestion, although I am not sure if everybody would=20
>aggries with it. Let's see! I would suggest to add a rule to step 1,=20
>about the "ement and ements" suffixes. Here it goes: "if preceded=20
>by "i=E8r", replace by "i=E8re" (with no consideration to R1, R2 of RV)".=
 The=20
>remaining "i=E8re" suffix would also be removed later by step 4. This would=
=20
>allow adverbs derived from the feminine adjectives to be together with=20
>other words that have closed meaning. For example, "premi=E8rement"=20
>("firstly") would do "premi=E8rement --> "premi=E8re" --> "premi". This=
 would=20
>allow the words "premier" ("first", masculine), "premi=E8re" ("first",=20
>feminine) and "premi=E8rement" ("firstly") to be grouped together. However,=
=20
>as I said, I am not sure if everybody wants the adverbs to be grouped=20
>with the adjectives and nouns. The actual algorithm separates the=20
>adjectives and nouns from the adverbs.
>
>Well, this is it! I hope I didn=92t make any mistake myself. The=20
>corrections I suggest seem to solve the problems to get the right answers=
=20
>according to the checking list. However, I don=92t know if the corrections=
=20
>would cause troubles with other words that aren=92t in the list. That would=
=20
>have to be verified.
>
>Fred Brault
>
>Attachment Converted: C:\EUDORA\ATTACH\3.gif
>