[Snowball-discuss] German Porter Stemmer in Javascript

Joder Illi joderilli at gmail.com
Tue Jul 20 10:10:47 BST 2010


Hi Kasun:
Do you know if it will be uploaded to the snowball project page?. I really
would appreciate that. ;-)

Regards:
     Joder Illi

2010/7/17 Kasun Gajasinghe <kasunbg at gmail.com>

> Hello Joder,
>
> Thank you for the code. I really appreciate your work! I Guess probably I
> would be the first one to use German stemmer in javascript :)
>
> regards,
> --KasunBG
>
>
> ~~~*******'''''''''''''*******~~~
> Kasun Gajasinghe,
> University of Moratuwa,
> Sri Lanka.
> Blog: http://kasunbg.blogspot.com
> Twitter: http://twitter.com/kasunbg
>
>
> On Fri, Jul 16, 2010 at 1:44 PM, Joder Illi <joderilli at gmail.com> wrote:
>
>> Hi Kasun.
>> My German stemmer Javascript implementation is complete, that means that I
>> run the 35000 words of the sample vocabulary through the stemmer and I had a
>> perfect match with the stemmed equivalent given on the page of the Snowball
>> project. I suppose there are some optimisations that could be done to the
>> Javascript, because I just wanted to get it working as soon as possible and
>> didn´t spend much time thinking about optimizing the implementetion. I
>> suppose that the  Javascript version of stemmer should be compatible with
>> the Java version, but I didn´t check it explicitly. Let me know if anyone
>> optimizes the implementation or if there are any differences with the Java
>> implementation.
>>
>> Regarding your question about Javascript stemmers for other European
>> languages, I remember seeing a Javascript implemantation of the english
>> stemmer while I was looking for the german stemmer (but I didn´t find it, so
>> I had to implement it by myself), I did´t come across Javascript
>> implementation in other languages.
>>
>> Regards,
>>    Joder Illi
>>
>> PS: Javascript source follows
>> Just include it in your page and call
>>
>> var stemmer = new Stemmer();
>> var stemmedWord = stemmer.stemm(wordToStem);
>>
>>
>> function Stemmer() {
>>     /*
>>     German includes the following accented forms,
>>     ä   ö   ü
>>     and a special letter, ß, equivalent to double s.
>>     The following letters are vowels:
>>     a   e   i   o   u   y   ä   ö   ü
>>     */
>>
>>     this.stemm = function(word) {
>>         /*
>>         Put u and y between vowels into upper case
>>         */
>>         word = word.replace(/([aeiouyäöü])u([aeiouyäöü])/g, '$1U$2');
>>         word = word.replace(/([aeiouyäöü])y([aeiouyäöü])/g, '$1Y$2');
>>
>>         /*
>>         and then do the following mappings,
>>         (a) replace ß with ss,
>>         (a) replace ae with ä,                          Not doing these,
>> have trouble with diphtongs
>>         (a) replace oe with ö,                          Not doing these,
>> have trouble with diphtongs
>>         (a) replace ue with ü unless preceded by q.     Not doing these,
>> have trouble with diphtongs
>>         So in quelle, ue is not mapped to ü because it follows q, and in
>> feuer it is not mapped because the first part of the rule changes it to
>> feUer, so the u is not found.
>>         */
>>         word = word.replace(/ß/g, 'ss');
>>         //word = word.replace(/ae/g, 'ä');
>>         //word = word.replace(/oe/g, 'ö');
>>         //word = word.replace(/([^q])ue/g, '$1ü');
>>
>>         /*
>>         R1 and R2 are first set up in the standard way (see the note on R1
>> and R2), but then R1 is adjusted so that the region before it contains at
>> least 3 letters.
>>         R1 is the region after the first non-vowel following a vowel, or
>> is the null region at the end of the word if there is no such non-vowel.
>>         R2 is the region after the first non-vowel following a vowel in
>> R1, or is the null region at the end of the word if there is no such
>> non-vowel.
>>         */
>>
>>         var r1Index = word.search(/[aeiouyäöü][^aeiouyäöü]/);
>>         var r1 = '';
>>         if (r1Index != -1) {
>>             r1Index += 2;
>>             r1 = word.substring(r1Index);
>>         }
>>
>>         var r2Index = -1;
>>         var r2 = '';
>>
>>         if (r1Index != -1) {
>>             var r2Index = r1.search(/[aeiouyäöü][^aeiouyäöü]/);
>>             if (r2Index != -1) {
>>                 r2Index += 2;
>>                 r2 = r1.substring(r2Index);
>>                 r2Index += r1Index;
>>             } else {
>>                 r2 = '';
>>             }
>>         }
>>
>>         if (r1Index != -1 && r1Index < 3) {
>>             r1Index = 3;
>>             r1 = word.substring(r1Index);
>>         }
>>
>>         /*
>>         Define a valid s-ending as one of b, d, f, g, h, k, l, m, n, r or
>> t.
>>         Define a valid st-ending as the same list, excluding letter r.
>>         */
>>
>>         /*
>>         Do each of steps 1, 2 and 3.
>>         */
>>
>>         /*
>>         Step 1:
>>         Search for the longest among the following suffixes,
>>         (a) em   ern   er
>>         (b) e   en   es
>>         (c) s (preceded by a valid s-ending)
>>         */
>>         var a1Index = word.search(/(em|ern|er)$/g);
>>         var b1Index = word.search(/(e|en|es)$/g);
>>         var c1Index = word.search(/([bdfghklmnrt]s)$/g);
>>         if (c1Index != -1) {
>>             c1Index++;
>>         }
>>         var index1 = 10000;
>>         var optionUsed1 = '';
>>         if (a1Index != -1 && a1Index < index1) {
>>             optionUsed1 = 'a';
>>             index1 = a1Index;
>>         }
>>         if (b1Index != -1 && b1Index < index1) {
>>             optionUsed1 = 'b';
>>             index1 = b1Index;
>>         }
>>         if (c1Index != -1 && c1Index < index1) {
>>             optionUsed1 = 'c';
>>             index1 = c1Index;
>>         }
>>
>>         /*
>>         and delete if in R1. (Of course the letter of the valid s-ending
>> is not necessarily in R1.) If an ending of group (b) is deleted, and the
>> ending is preceded by niss, delete the final s.
>>         (For example, äckern -> äck, ackers -> acker, armes -> arm,
>> bedürfnissen -> bedürfnis)
>>         */
>>
>>         if (index1 != 10000 && r1Index != -1) {
>>             if (index1 >= r1Index) {
>>                 word = word.substring(0, index1);
>>                 if (optionUsed1 == 'b') {
>>                     if (word.search(/niss$/) != -1) {
>>                         word = word.substring(0, word.length -1);
>>                     }
>>                 }
>>             }
>>         }
>>         /*
>>         Step 2:
>>         Search for the longest among the following suffixes,
>>         (a) en   er   est
>>         (b) st (preceded by a valid st-ending, itself preceded by at least
>> 3 letters)
>>         */
>>
>>         var a2Index = word.search(/(en|er|est)$/g);
>>         var b2Index = word.search(/(.{3}[bdfghklmnt]st)$/g);
>>         if (b2Index != -1) {
>>             b2Index += 4;
>>         }
>>
>>         var index2 = 10000;
>>         var optionUsed2 = '';
>>         if (a2Index != -1 && a2Index < index2) {
>>             optionUsed2 = 'a';
>>             index2 = a2Index;
>>         }
>>         if (b2Index != -1 && b2Index < index2) {
>>             optionUsed2 = 'b';
>>             index2 = b2Index;
>>         }
>>
>>         /*
>>         and delete if in R1.
>>         (For example, derbsten -> derbst by step 1, and derbst -> derb by
>> step 2, since b is a valid st-ending, and is preceded by just 3 letters)
>>         */
>>
>>         if (index2 != 10000 && r1Index != -1) {
>>             if (index2 >= r1Index) {
>>                 word = word.substring(0, index2);
>>             }
>>         }
>>
>>         /*
>>         Step 3: d-suffixes (*)
>>         Search for the longest among the following suffixes, and perform
>> the action indicated.
>>         end   ung
>>             delete if in R2
>>             if preceded by ig, delete if in R2 and not preceded by e
>>         ig   ik   isch
>>             delete if in R2 and not preceded by e
>>         lich   heit
>>             delete if in R2
>>             if preceded by er or en, delete if in R1
>>         keit
>>             delete if in R2
>>             if preceded by lich or ig, delete if in R2
>>         */
>>
>>         var a3Index = word.search(/(end|ung)$/g);
>>         var b3Index = word.search(/[^e](ig|ik|isch)$/g);
>>         var c3Index = word.search(/(lich|heit)$/g);
>>         var d3Index = word.search(/(keit)$/g);
>>         if (b3Index != -1) {
>>             b3Index ++;
>>         }
>>
>>         var index3 = 10000;
>>         var optionUsed3 = '';
>>         if (a3Index != -1 && a3Index < index3) {
>>             optionUsed3 = 'a';
>>             index3 = a3Index;
>>         }
>>         if (b3Index != -1 && b3Index < index3) {
>>             optionUsed3 = 'b';
>>             index3 = b3Index;
>>         }
>>         if (c3Index != -1 && c3Index < index3) {
>>             optionUsed3 = 'c';
>>             index3 = c3Index;
>>         }
>>         if (d3Index != -1 && d3Index < index3) {
>>             optionUsed3 = 'd';
>>             index3 = d3Index;
>>         }
>>
>>         if (index3 != 10000 && r2Index != -1) {
>>             if (index3 >= r2Index) {
>>                 word = word.substring(0, index3);
>>                 var optionIndex = -1;
>>                 var optionSubsrt = '';
>>                 if (optionUsed3 == 'a') {
>>                     optionIndex = word.search(/[^e](ig)$/);
>>                     if (optionIndex != -1) {
>>                         optionIndex++;
>>                         if (optionIndex >= r2Index) {
>>                             word = word.substring(0, optionIndex);
>>                         }
>>                     }
>>                 } else if (optionUsed3 == 'c') {
>>                     optionIndex = word.search(/(er|en)$/);
>>                     if (optionIndex != -1) {
>>                         if (optionIndex >= r1Index) {
>>                             word = word.substring(0, optionIndex);
>>                         }
>>                     }
>>                 } else if (optionUsed3 == 'd') {
>>                     optionIndex = word.search(/(lich|ig)$/);
>>                     if (optionIndex != -1) {
>>                         if (optionIndex >= r2Index) {
>>                             word = word.substring(0, optionIndex);
>>                         }
>>                     }
>>                 }
>>             }
>>         }
>>
>>         /*
>>         Finally,
>>         turn U and Y back into lower case, and remove the umlaut accent
>> from a, o and u.
>>         */
>>         word = word.replace(/U/g, 'u');
>>         word = word.replace(/Y/g, 'y');
>>         word = word.replace(/ä/g, 'a');
>>         word = word.replace(/ö/g, 'o');
>>         word = word.replace(/ü/g, 'u');
>>
>>         return word;
>>     };
>> }
>>
>> 2010/7/15 Kasun Gajasinghe <kasunbg at gmail.com>
>>
>>
>>>
>>> On Thu, Jul 15, 2010 at 9:18 PM, Joder Illi <joderilli at gmail.com> wrote:
>>>
>>>> Hi there, I just made a Javascript implementation of the German Porter
>>>> Stemmer (the first variant). If you are interested, let me know and I will
>>>> provide the sources for you to review and publish.
>>>
>>>
>>> Hello Joder,
>>> It's very nice to hear that you implemented the German stemmer in
>>> JavaScript. I am very much willing to use it as I was looking for European
>>> Stemmers written in JavaScript for quite sometime.
>>>
>>> If you can provide the source that would be great, and please specify the
>>> level of implementation. i.e. I would like to know whether the stemmer is
>>> fully implemented or there's some work left to be done. As, I will be using
>>> it along with Java version as well,  the stemmer should be compatible with
>>> the Java version.
>>>
>>> Your work is appreciated!
>>>
>>> BTW, does anyone know whether there are other stemmers written for other
>>> European languages (in JavaScript) as well ?
>>>
>>> regards,
>>> --KasunBG
>>>
>>>
>>>> Greetings:
>>>>        Joder Illi
>>>>
>>>> _______________________________________________
>>>> Snowball-discuss mailing list
>>>> Snowball-discuss at lists.tartarus.org
>>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>>
>>>>
>>> ~~~*******'''''''''''''*******~~~
>>> Kasun Gajasinghe,
>>> University of Moratuwa,
>>> Sri Lanka.
>>> Blog: http://kasunbg.blogspot.com
>>> Twitter: http://twitter.com/kasunbg
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100720/6a703fbd/attachment-0001.htm>


More information about the Snowball-discuss mailing list