[Snowball-discuss] Why is German2 variant stemmer abandoned?

Martin Porter martin.f.porter at gmail.com
Thu Jan 26 11:07:57 GMT 2012


Piotr,

Hi -- the German2 stemmer is not abandoned in any sense, and I can
believe that, for internet use, it is nowadays more useful than the
"pure" German stemmer. Your comments on umlauts make sense (your point
(1)), and I see them as the same as the assumptions about German2 on
the snowball site. You should take note of the exact rules though, so
"feuer" is not equated with feu"r for example (using " to indicate
umlaut).

Historically, the umlaut accent stands for an omitted letter e.
e-for-umaut tends to be used in proper names that preserve an old
spelling (Goethe etc) and nowadays of course all over the internet,
especially in domain names.

libstemmer was put together by Richard Boulton, with a suitable choice
of stemmers. I see no problem adding german2 (your point (2)) -- it is
quite small -- but equally it should not be so hard to replace
'german' with 'german2' at your end I would have thought. Meanwhile
handcoding the mappings at least solves the problem.

(Any thoughts, Richard?)

Yes, www.restaurant-kritik.de is a glorious mix! Note 'neue',
frequently used, and not to be mapped to umlaut form (like feuer).

Martin



On Wed, Jan 25, 2012 at 1:00 PM, Piotr Marciniak <mandaryn at ragnarson.com> wrote:
> Hi,
>
> German2 variant stemmer seems to be abandoned (not compiled - harder to
> use), however I find it much more useful that the main German stemmer.
>
> Correct me if I am wrong, but the cases of using ae, oe and ue in german
> vocabulary are excessive. In fact it seems widely accepted to use the two
> letter form everywhere the non ascii characters are forbidden.
>
> Taking for example city names, using "Köln" and "Koeln", "München" and
> "Muenchen", even the city websites "keoln.de", "muenched.de" use the second
> form.
>
> I would like to have the same stem of both "Köln" and "Koeln", without
> additional work :).
>
> I am not a German speaker, just working of on a POI site using lots of
> different german names (cities and restaurant, see
> http://www.restaurant-kritik.de if you wish), but the uniform matching of
> "ü" and "ue" and other umlauts was a primary requirement. We use sphinx with
> normal german stemmer, but i manually fix the words before the search and
> index, replacing all ue, oe, ae with u, o ,a, which works but makes the
> things more complex.
>
> The questions I have are:
>
> 1) Am I making sense with my assumptions about umlauts above, It's somewhat
> contrary to what is written
> here http://snowball.tartarus.org/algorithms/german2/stemmer.html but i was
> told that every word can really be written without umlauts this way.
>
> 2) Could the german variant stemmer get back to the main distribution of
> libstemmer? I tried some compiling, but my C skill are too poor to create a
> production ready package with correct stemmer.
>
> Piotr Marciniak



More information about the Snowball-discuss mailing list