[Snowball-discuss] German Porter Stemmer in Javascript
Kasun Gajasinghe
kasunbg at gmail.com
Sat Jul 17 08:21:27 BST 2010
Hello Joder,
Thank you for the code. I really appreciate your work! I Guess probably I
would be the first one to use German stemmer in javascript :)
regards,
--KasunBG
~~~*******'''''''''''''*******~~~
Kasun Gajasinghe,
University of Moratuwa,
Sri Lanka.
Blog: http://kasunbg.blogspot.com
Twitter: http://twitter.com/kasunbg
On Fri, Jul 16, 2010 at 1:44 PM, Joder Illi <joderilli at gmail.com> wrote:
> Hi Kasun.
> My German stemmer Javascript implementation is complete, that means that I
> run the 35000 words of the sample vocabulary through the stemmer and I had a
> perfect match with the stemmed equivalent given on the page of the Snowball
> project. I suppose there are some optimisations that could be done to the
> Javascript, because I just wanted to get it working as soon as possible and
> didn´t spend much time thinking about optimizing the implementetion. I
> suppose that the Javascript version of stemmer should be compatible with
> the Java version, but I didn´t check it explicitly. Let me know if anyone
> optimizes the implementation or if there are any differences with the Java
> implementation.
>
> Regarding your question about Javascript stemmers for other European
> languages, I remember seeing a Javascript implemantation of the english
> stemmer while I was looking for the german stemmer (but I didn´t find it, so
> I had to implement it by myself), I did´t come across Javascript
> implementation in other languages.
>
> Regards,
> Joder Illi
>
> PS: Javascript source follows
> Just include it in your page and call
>
> var stemmer = new Stemmer();
> var stemmedWord = stemmer.stemm(wordToStem);
>
>
> function Stemmer() {
> /*
> German includes the following accented forms,
> ä ö ü
> and a special letter, ß, equivalent to double s.
> The following letters are vowels:
> a e i o u y ä ö ü
> */
>
> this.stemm = function(word) {
> /*
> Put u and y between vowels into upper case
> */
> word = word.replace(/([aeiouyäöü])u([aeiouyäöü])/g, '$1U$2');
> word = word.replace(/([aeiouyäöü])y([aeiouyäöü])/g, '$1Y$2');
>
> /*
> and then do the following mappings,
> (a) replace ß with ss,
> (a) replace ae with ä, Not doing these,
> have trouble with diphtongs
> (a) replace oe with ö, Not doing these,
> have trouble with diphtongs
> (a) replace ue with ü unless preceded by q. Not doing these,
> have trouble with diphtongs
> So in quelle, ue is not mapped to ü because it follows q, and in
> feuer it is not mapped because the first part of the rule changes it to
> feUer, so the u is not found.
> */
> word = word.replace(/ß/g, 'ss');
> //word = word.replace(/ae/g, 'ä');
> //word = word.replace(/oe/g, 'ö');
> //word = word.replace(/([^q])ue/g, '$1ü');
>
> /*
> R1 and R2 are first set up in the standard way (see the note on R1
> and R2), but then R1 is adjusted so that the region before it contains at
> least 3 letters.
> R1 is the region after the first non-vowel following a vowel, or is
> the null region at the end of the word if there is no such non-vowel.
> R2 is the region after the first non-vowel following a vowel in R1,
> or is the null region at the end of the word if there is no such non-vowel.
> */
>
> var r1Index = word.search(/[aeiouyäöü][^aeiouyäöü]/);
> var r1 = '';
> if (r1Index != -1) {
> r1Index += 2;
> r1 = word.substring(r1Index);
> }
>
> var r2Index = -1;
> var r2 = '';
>
> if (r1Index != -1) {
> var r2Index = r1.search(/[aeiouyäöü][^aeiouyäöü]/);
> if (r2Index != -1) {
> r2Index += 2;
> r2 = r1.substring(r2Index);
> r2Index += r1Index;
> } else {
> r2 = '';
> }
> }
>
> if (r1Index != -1 && r1Index < 3) {
> r1Index = 3;
> r1 = word.substring(r1Index);
> }
>
> /*
> Define a valid s-ending as one of b, d, f, g, h, k, l, m, n, r or
> t.
> Define a valid st-ending as the same list, excluding letter r.
> */
>
> /*
> Do each of steps 1, 2 and 3.
> */
>
> /*
> Step 1:
> Search for the longest among the following suffixes,
> (a) em ern er
> (b) e en es
> (c) s (preceded by a valid s-ending)
> */
> var a1Index = word.search(/(em|ern|er)$/g);
> var b1Index = word.search(/(e|en|es)$/g);
> var c1Index = word.search(/([bdfghklmnrt]s)$/g);
> if (c1Index != -1) {
> c1Index++;
> }
> var index1 = 10000;
> var optionUsed1 = '';
> if (a1Index != -1 && a1Index < index1) {
> optionUsed1 = 'a';
> index1 = a1Index;
> }
> if (b1Index != -1 && b1Index < index1) {
> optionUsed1 = 'b';
> index1 = b1Index;
> }
> if (c1Index != -1 && c1Index < index1) {
> optionUsed1 = 'c';
> index1 = c1Index;
> }
>
> /*
> and delete if in R1. (Of course the letter of the valid s-ending is
> not necessarily in R1.) If an ending of group (b) is deleted, and the ending
> is preceded by niss, delete the final s.
> (For example, äckern -> äck, ackers -> acker, armes -> arm,
> bedürfnissen -> bedürfnis)
> */
>
> if (index1 != 10000 && r1Index != -1) {
> if (index1 >= r1Index) {
> word = word.substring(0, index1);
> if (optionUsed1 == 'b') {
> if (word.search(/niss$/) != -1) {
> word = word.substring(0, word.length -1);
> }
> }
> }
> }
> /*
> Step 2:
> Search for the longest among the following suffixes,
> (a) en er est
> (b) st (preceded by a valid st-ending, itself preceded by at least
> 3 letters)
> */
>
> var a2Index = word.search(/(en|er|est)$/g);
> var b2Index = word.search(/(.{3}[bdfghklmnt]st)$/g);
> if (b2Index != -1) {
> b2Index += 4;
> }
>
> var index2 = 10000;
> var optionUsed2 = '';
> if (a2Index != -1 && a2Index < index2) {
> optionUsed2 = 'a';
> index2 = a2Index;
> }
> if (b2Index != -1 && b2Index < index2) {
> optionUsed2 = 'b';
> index2 = b2Index;
> }
>
> /*
> and delete if in R1.
> (For example, derbsten -> derbst by step 1, and derbst -> derb by
> step 2, since b is a valid st-ending, and is preceded by just 3 letters)
> */
>
> if (index2 != 10000 && r1Index != -1) {
> if (index2 >= r1Index) {
> word = word.substring(0, index2);
> }
> }
>
> /*
> Step 3: d-suffixes (*)
> Search for the longest among the following suffixes, and perform
> the action indicated.
> end ung
> delete if in R2
> if preceded by ig, delete if in R2 and not preceded by e
> ig ik isch
> delete if in R2 and not preceded by e
> lich heit
> delete if in R2
> if preceded by er or en, delete if in R1
> keit
> delete if in R2
> if preceded by lich or ig, delete if in R2
> */
>
> var a3Index = word.search(/(end|ung)$/g);
> var b3Index = word.search(/[^e](ig|ik|isch)$/g);
> var c3Index = word.search(/(lich|heit)$/g);
> var d3Index = word.search(/(keit)$/g);
> if (b3Index != -1) {
> b3Index ++;
> }
>
> var index3 = 10000;
> var optionUsed3 = '';
> if (a3Index != -1 && a3Index < index3) {
> optionUsed3 = 'a';
> index3 = a3Index;
> }
> if (b3Index != -1 && b3Index < index3) {
> optionUsed3 = 'b';
> index3 = b3Index;
> }
> if (c3Index != -1 && c3Index < index3) {
> optionUsed3 = 'c';
> index3 = c3Index;
> }
> if (d3Index != -1 && d3Index < index3) {
> optionUsed3 = 'd';
> index3 = d3Index;
> }
>
> if (index3 != 10000 && r2Index != -1) {
> if (index3 >= r2Index) {
> word = word.substring(0, index3);
> var optionIndex = -1;
> var optionSubsrt = '';
> if (optionUsed3 == 'a') {
> optionIndex = word.search(/[^e](ig)$/);
> if (optionIndex != -1) {
> optionIndex++;
> if (optionIndex >= r2Index) {
> word = word.substring(0, optionIndex);
> }
> }
> } else if (optionUsed3 == 'c') {
> optionIndex = word.search(/(er|en)$/);
> if (optionIndex != -1) {
> if (optionIndex >= r1Index) {
> word = word.substring(0, optionIndex);
> }
> }
> } else if (optionUsed3 == 'd') {
> optionIndex = word.search(/(lich|ig)$/);
> if (optionIndex != -1) {
> if (optionIndex >= r2Index) {
> word = word.substring(0, optionIndex);
> }
> }
> }
> }
> }
>
> /*
> Finally,
> turn U and Y back into lower case, and remove the umlaut accent
> from a, o and u.
> */
> word = word.replace(/U/g, 'u');
> word = word.replace(/Y/g, 'y');
> word = word.replace(/ä/g, 'a');
> word = word.replace(/ö/g, 'o');
> word = word.replace(/ü/g, 'u');
>
> return word;
> };
> }
>
> 2010/7/15 Kasun Gajasinghe <kasunbg at gmail.com>
>
>
>>
>> On Thu, Jul 15, 2010 at 9:18 PM, Joder Illi <joderilli at gmail.com> wrote:
>>
>>> Hi there, I just made a Javascript implementation of the German Porter
>>> Stemmer (the first variant). If you are interested, let me know and I will
>>> provide the sources for you to review and publish.
>>
>>
>> Hello Joder,
>> It's very nice to hear that you implemented the German stemmer in
>> JavaScript. I am very much willing to use it as I was looking for European
>> Stemmers written in JavaScript for quite sometime.
>>
>> If you can provide the source that would be great, and please specify the
>> level of implementation. i.e. I would like to know whether the stemmer is
>> fully implemented or there's some work left to be done. As, I will be using
>> it along with Java version as well, the stemmer should be compatible with
>> the Java version.
>>
>> Your work is appreciated!
>>
>> BTW, does anyone know whether there are other stemmers written for other
>> European languages (in JavaScript) as well ?
>>
>> regards,
>> --KasunBG
>>
>>
>>> Greetings:
>>> Joder Illi
>>>
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>
>>>
>> ~~~*******'''''''''''''*******~~~
>> Kasun Gajasinghe,
>> University of Moratuwa,
>> Sri Lanka.
>> Blog: http://kasunbg.blogspot.com
>> Twitter: http://twitter.com/kasunbg
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100717/46dd5027/attachment-0001.htm>
More information about the Snowball-discuss
mailing list