[Snowball-discuss] Stemming French words which have a plural in "x"
Yann Barsamian
Yann.Barsamian at ulb.ac.be
Tue Apr 30 13:25:28 BST 2019
Hi everybody,
I'm a French researcher in computer science, and am using Snowball for
my research (textual indexing of documents). First of all, thank you for
providing such a tool !
I see that Snowball's French stemming algorithm treats plurals in "eaux"
(like "châteaux" is the plural of "château") or in "aux" ("chevaux" is
the plural of "cheval"), together with adjectives in "eux", but not
other plurals in "x". The problem with French is that it has a lot of
irregular plurals. I was wondering if it was possible to update the
French stemmer of Snowball to reflect this fact ?
For example, I noticed that "voeux" is not stemmed as "voeu" (also
"jeux" is not stemmed as "jeu", etc.). Is there something that can be
done to enhance the stemming algorithm ? It could also include the
following (non-exhaustive, but just to have a baseline for further
discussions) irregular plurals :
* 7 names (hibou, caillou, chou, bijou, genou, joujou, pou) have their
plural in "x" (hiboux, cailloux, choux, bijoux, genoux, joujoux, poux)
* "yeux" is the plural of "oeil"
I also noticed some missed stemmings :
* "illustrent" and "illustre" are not stemmed as "illustr" as it could
be
I wanted to check in the archives of the mailing list if this has
already been discussed before, but it is not very convenient to search
in the archives, so I'm sorry if this has already been debated before.
If this seems interesting for you, and if you have unit tests for the
French stemming algorithm (if I update the algorithm, at least there
should be some tests to make sure everything else still works as
intended !), I would be willing to contribute to the algorithm to
enhance it, even though for now I do not really now how to do it.
Have a nice day,
Yann
More information about the Snowball-discuss
mailing list