[Snowball-discuss] Why is German2 variant stemmer abandoned?

Piotr Marciniak mandaryn at ragnarson.com
Wed Jan 25 13:00:39 GMT 2012


Hi,

German2 variant stemmer seems to be abandoned (not compiled - harder to use), however I find it much more useful that the main German stemmer.

Correct me if I am wrong, but the cases of using ae, oe and ue in german vocabulary are excessive. In fact it seems widely accepted to use the two letter form everywhere the non ascii characters are forbidden.

Taking for example city names, using "Köln" and "Koeln", "München" and "Muenchen", even the city websites "keoln.de", "muenched.de" use the second form.

I would like to have the same stem of both "Köln" and "Koeln", without additional work :).

I am not a German speaker, just working of on a POI site using lots of different german names (cities and restaurant, see http://www.restaurant-kritik.de if you wish), but the uniform matching of "ü" and "ue" and other umlauts was a primary requirement. We use sphinx with normal german stemmer, but i manually fix the words before the search and index, replacing all ue, oe, ae with u, o ,a, which works but makes the things more complex.

The questions I have are:

1) Am I making sense with my assumptions about umlauts above, It's somewhat contrary to what is written here http://snowball.tartarus.org/algorithms/german2/stemmer.html but i was told that every word can really be written without umlauts this way.

2) Could the german variant stemmer get back to the main distribution of libstemmer? I tried some compiling, but my C skill are too poor to create a production ready package with correct stemmer.

Piotr Marciniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20120125/783584af/attachment.htm>


More information about the Snowball-discuss mailing list