[Snowball-discuss] The Norwegian stemmer algorithm

Martin Porter martin_porter@softhome.net
Thu, 29 Nov 2001 01:30:17 -0700


Ask,

>But as far as I can tell, this algorithm already takes a lot of nynorsk,
>because -ar, -ande, -ast, -ane, -eleg, -eig and -leg is not "bokm=E5l" but
>nynorsk.

I developed the algorithm with a particular vocabulary which I put together
myself by downloads from the Web. I had assumed that the texts were entirel=
y
bokmal Norwegian, but I must have been in error here. I am quite willing to
redo the work if you can guide me to texts in nynorsk and bokmal separately
- you need about 4-5 megabytes of a language as a sample, and the texts
should be as plain as possible as far as mark-up goes, and representative o=
f
the contemporary language. If on the other hand the simple Norwegian stemme=
r
I've presented works equally on bokmal and nynorsk so much the better. I
suppose in setting up IR systems of Norwegian text it must be an
inconvenience needing to separate the two dialects.

Martin



_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________