[Snowball-discuss] Snowball-discuss Digest, Vol 63, Issue 2

Wed Apr 28 18:55:39 BST 2010

Friedel,

Thanks for the suggestion.  I've spot-checked some words in the online version of Hunspell (http://www.j3e.de/cgi-bin/spellchecker).  It fails on the kind of specialized vocabulary with which I am dealing.  It seems very dependent not only on its dictionary, but also upon some rules which don't seem to apply to my documents.  For instance, "Enzym" and "Hemmung" are correct, but not "Enzymhemmung".  Yet the author of the paper that contains "Enzymhemmung" and those who read such papers understand what is meant.

Friedel, Martin,

I have created a decompounder that builds a vocabulary based on the actual text.  It determines how often a word occurs in the corpus, both alone and as a substring in other words.  It then assigns such words -- I call them "infixes" -- two values:  the frequency of occurence and "stickiness," the ratio of the number of times it occurs as a substring to its frequency.  When decompounding a word, I build all possible decompositions, then use the log-average of the frequencies, the log-average of the stickinesses, and the ratio of the two, to select the best one.  The search for decompositions is iterative and proceeds from left to right, seeking infixes that match the beginning of the word, removing them, etc.  Whenever no such infix can be found, I move to the end of the (rest of the) word and look for infixes that end at the end.  This process ends when no such infixes can be found.  At that point, I might have some characters left.  I regard them as a yet-unseen word.  Such words are assigned a frequency of 0.5 (rarer than the rarest) and a stickiness of 30 (the mean in my list of infixes, but 99% of the infixes have lower stickiness, this 30 means so sticky that it almost never occurs alone).  This algorithm produces good results and is language-agnostic.  Thus, it could be applied to English texts in the field of chemistry, where such compounds are often "invented."

With this I'm back to stemming.  I'm investigating an algorithm proposed by Jörg Caumanns, Freie Universität Berlin, in 1998.  I am comparing its performance to the Porter stemmer for German.  I have run it on the German Words data.  In order to measure how close the results are, I imagine that I have two different clusterings, one determined by the Porter stemmer and one determined by mine.  In each case, a cluster consists of all the words that have the same stem.  Then I compute the corrected Rand index of the two clusterings.  The near to 1 the index, the more the clusterings are the same.  The Rand index for this word list and these two stemmers is 0.84.  This is probably not too bad, because my stemmer is supposed to conflate some words that would not be conflated with the Porter stemmer.  I am now assessing the two stemmers on the words on which the decompounder is based.

Regards,
Richard

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Mobil: +41 79 708 67 66
Email:  richard.liu at pueo-owl.ch

On Apr 28, 2010, at 13:00 , snowball-discuss-request at lists.tartarus.org wrote:
> 
> Message: 2
> Date: Tue, 27 Apr 2010 20:50:30 +0200
> From: F Wolff <friedel at translate.org.za>
> Subject: Re: [Snowball-discuss] German stemmer:  hemmung -> hemmung,
> 	enzymhemmung -> enzymhemm
> To: snowball-discuss at lists.tartarus.org
> Message-ID: <1272394231.9033.3324.camel at localhost.localdomain>
> Content-Type: text/plain
> 
> Op Di, 2010-04-27 om 16:27 +0200 skryf Martin Porter:
>> Richard,
>> 
>> The stemming anomalies you note don't matter so much for general IR work,
>> but do for the work that you are doing. It seems to me that you need a
>> German word splitter, so that enzymhemmung is split to enzym+hemmung etc.
>> Lemmatization systems do this. You can find sources by typing "german
>> lemmatization" into Google. In the past I've been involved with two
>> companies that do this work, Inxight and Teragram. Since working with them,
>> both have been taken over by larger companies. Their work was proprietory,
>> with a licence fee arrangement for their use. 
>> 
>> Are there open source solutions here? I do not know. If you, or anyone else,
>> can share better information than I have it would be useful,
>> 
>> Martin
> 
> I am partly assuming that it will work, but the Hunspell spell checker
> used in OpenOffice.org, Mozilla products and elsewhere can do
> morphological analysis, which should include supports for compounding
> for German. It might provide a good start.
> 
> Keep well
> Friedel
> 
> 
> --
> Recently on my blog:
> http://translate.org.za/blogs/friedel/en/content/how-should-we-do-high-contrast-application-icons
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
> 
> End of Snowball-discuss Digest, Vol 63, Issue 2
> ***********************************************
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3686 bytes
Desc: not available
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100428/8d864de9/attachment.bin>