[Snowball-discuss] German stemmer: hemmung -> hemmung, enzymhemmung -> enzymhemm
Richard R. Liu
richard.liu at pueo-owl.ch
Mon Apr 26 13:22:40 BST 2010
Having read the German stemmer rules, I understand how "hemmung" is left as is,
but "enzymhemmung" becomes "enzymhemm". However, this anomaly causes problems
in a procedure that I am adapting from English. The procedure should conflate
"hemmung von enzymen", "enzymhemmung", "enzymhemmend" by applying these steps to
each:
1. Decompound
2. Stem each morpheme
3. Discard stopwords
4. Sort the morphemes
Thus:
* "hemmung von enzymen" -> "hemm von enzym" -> "enzym hemm"
* "enzymhemmung" -> "enzym hemmung" -> "enzym hemm"
* "enzymhemmend" -> "enzym hemmend" -> "enzym hemm"
Unfortunately, in the last two cases, the Snowball stemmer leaves "hemmung" and
"hemmend" unchanged. Stemming before decompounding is not an option, because
the decompounder is unsupervised and knowledge-free, i.e., it derives a list of
words that occur in other words from the corpus itself, in which "hemm" does not
occur as a word.
In effect, "hemmend" and "hemmung" are left unchanged because "end" and "ung" do
not occur in R2. By creating a compound word from "hemmung" R2 is extended
left, and "end" and "ung" now occur within R2, so they are deleted. Thus, the
Porter stemmer for German is liable to produce different stems for such words,
depending on whether the word occurs alone or in a compound.
Is the consenus within the community nevertheless that the German stemmer is
correct? Has any work been done to develop a stemmer that solves the problem
described above? In "A Fast and Simple Stemming Algorithm for German Words", J.
Caumanns, Frei Universität Berlin, 1998, an alternative is described that could
deal with the problem. Does anybody know about more recent work done in this
area?
Thanks,
Richard
Richard R. Liu
richard.liu at pueo-owl.ch
More information about the Snowball-discuss
mailing list