[Snowball-discuss] handling english plural words in dutch stemmer
Martin Porter
martin.f.porter at gmail.com
Thu Jan 12 10:12:16 GMT 2012
Srini,
Yes, I can see that you have hit a really awkward case here, where you
have a language (Dutch) with a large import of foreign words (mainly
English), and where the foreign words are often names of products:
cameras, TVs and so on, and therefore critically important in your
application.
Just to clarify a bit, the Dutch stemmer (generally speaking) removes
final -s unless preceded by a vowel (j is classed as a vowel). So
"printers" goes to "printer" and "ipods" to "ipod" (despite your
example). But the large class of words ending -<vowel>s is unchanged:
cameras, phones, fridges, biros etc.
I rather doubt if removing the -s more generally affects Dutch too
much. One approach is therefore to relax the rule so that any letter
before the -s is a valid s-ending.
A simpler approach, and one that will probably give a better result,
is to apply Dutch stemming, and if the word is unchanged after
stemming and ends -s, apply s-stemming, which is simple enough to
write as in-line code,
(Roughly, -ies -> y, or -es -> e, or ss -> s, or s -> null)
see
http://www.gossamer-threads.com/lists/lucene/java-dev/102395?do=post_view_threaded
Your case tvs -> tvs is a problem in English too. the stemmers don't
think tvs, DVDs etc are really words, so leave the -s in place. But
simple -s stemming could effect their removal. I'd be inclined to give
this approach a try and see if it's good enough,
Martin
On Wed, Jan 11, 2012 at 1:54 AM, Srinivasan Ramaswamy
<ursvasan at gmail.com> wrote:
> Hi All,
>
> I use the dutch snowball stemmer. It does well for dutch words, but sometimes I
> have to handle some english words too. For example tvs, cameras, ipods, etc. I
> noticed that these words doesnt get stemmed.
>
> tvs =>(after stemming) tvs
> cameras =>(after stemming) cameras
> . . . . . . . .
More information about the Snowball-discuss
mailing list