[Snowball-discuss] RE: Snowball-discuss digest, Vol 1 #5 - 1
msg
Oleg Bartunov
oleg@sai.msu.su
Mon Sep 9 10:46:01 2002
Svetlana - you're my hero ! I'd like to test russian stemmer
(current version is available: http://intra.astronet.ru/db/lingua/snowball/)
I'm working on development of Russian Scientific Network (www.nature.ru)
and we found stemming is very important for scientific corpora.
Oleg
On Mon, 9 Sep 2002, Svetlana Pereyaslavets wrote:
> Dear Martin,
> I am not a linguist, but a native Russian speaker. May I try to give some
> explanation on this suffix.
> Free, but hopefully helpful :-)
> It is a very common in Russian adjectives and adverbs when we deal with a
> construction:
>
> *****basic*construction********
> prefix-root - "other optional" suffix - n (Oleg's question) - <adjective
> ending> ( = yi/iy/oy....through all genders and declinations)
> *******************************
> The rule has the following options:
> 1. prefix-root - n - <adjective ending >
>
> 1.1. The root itself ends on -n-
> In this case we will encounter -nn- after stripping the adjective ending,
> and we SHOULD REMOVE one -n- (that is the suffix).
> Such words usually don't have prefixes (so can be easily compared to the
> dictionary).
> Example : kon-n-yi (adjective from "kon'"=horse)
>
> 1.2 The root ends on any other letter
>
> we SHOULD REMOVE the -n- (that is the suffix).
> Example: ruch-n-oy (adjective from "ruka"= hand).
>
> 2. prefix-root - "other optional"suffix - n - adjective ending
>
>
> 2.1. other optional suffix = - an- or - yan -
> - a- or -ya- SHOULD BE REMOVED TOGETHER with the suffix -n-.
>
> Example: "sherst-yan-oy" (=woolen).
>
> THREE exceptions from this rule would fall under case 2.2:
> "stekl-yan -n - <adjective ending>" (adj from glass)
> "olov-yan -n - <adjective ending>" (adj from tin)
> "derev-yan -n - <adjective ending>" (adj from wood)
>
> 2.2. other optional suffix = -on - or -en-
>
> REMOVE -n- and following -en- or -on-.
>
> Example: "osob - en- n- <adjective ending>" (=special)
>
> ONE exception from this rule would fall under case 2.1:
> "ran -en- <adjective ending>" (= injured)
>
> 2.3. HARD CASE (RUSSIAN LEXICAL DIVERSITY IS INVOLVED) - I can't suggest a
> solution right now, as I need time to think how to detect that without
> knowledge of the natural language:
>
> other optional suffix = -in
>
> 2.3.1. If the following substitution is valid refer to 1.1. or 2.1
> (i.e. -n- SHOULD BE REMOVED, following -in- siffix MAY and probably SHOULD
> be removed depending on the required detailisation)
>
> - a (noun)
> /
> root - in -|
> \
> -n- <adjective ending>
>
>
> Example: "star-in-a"-"star-in-n -yi" (= old)
>
> 2.3.2 If the substitution above is not valid refer to 1.2. or 2.2. with
> the same reservation.
>
> Example: "mysh-in - <adjective ending> " (adjective from "mysh'"= mouse)
>
> 3. PARTICIPLE II may look the same as an adjective for an end-stripping
> stemmer.
> In Participles II, the scheme is :
>
> word - {-on, -en, -an, -yan} - n - <adjective ending>
>
> Where "word" is VERY LIKELY to consist of "prefix-root" (i.e. there is a
> high probability that participle II would have a prefix).
>
>
>
>
> It may look too complicated, please email if you need to clarify something.
> Or, please allow me some time to return to this topic and come up with a
> digestable algorithm. Actually, I was planning to test Russian stemmer in
> the scope of my student research in December this year.
>
> Kind regards
>
> Svetlana
>
>
>
>
>
>
> -----Original Message-----
> From: snowball-discuss-admin@lists.tartarus.org
> [mailto:snowball-discuss-admin@lists.tartarus.org]On Behalf Of
> snowball-discuss-request@lists.tartarus.org
> Sent: Monday, September 09, 2002 5:45 PM
> To: snowball-discuss@lists.tartarus.org
> Subject: Snowball-discuss digest, Vol 1 #5 - 1 msg
>
>
> Send Snowball-discuss mailing list submissions to
> snowball-discuss@lists.tartarus.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> or, via email, send a message with subject or body 'help' to
> snowball-discuss-request@lists.tartarus.org
>
> You can reach the person managing the list at
> snowball-discuss-admin@lists.tartarus.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Snowball-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: russian stemmer (Martin Porter)
>
> --__--__--
>
> Message: 1
> To: Oleg Bartunov <oleg@sai.msu.su>
> From: martin_porter@softhome.net (Martin Porter)
> Cc: snowball-discuss@lists.tartarus.org
> Date: Sun, 08 Sep 2002 23:12:26 -0600
> Subject: [Snowball-discuss] Re: russian stemmer
>
>
> Oleg,
>
> I've had a look at -n-ogo, -n-yi etc endings through the Russian vocabulary,
> and feel that I would need to take linguistic advice before I could make any
> progress with -n- removal.
>
> As you may recall, I did the Russian stemmer with a linguist, Pat Miles, who
> lives some 60 miles away, and is not really a computer user. Also, Pat
> charges for his work, which is a further inconvenience to me! I'd rather try
> to get free linguistic help now through the open source community. Is there
> anyone you know in Russia who might experiment a bit further with the
> Snowball stemmer to see if they could make improvements here?
>
> Martin
>
> >current russian stemmer seems doesn't treat adjective endings like:
> >'nogo', 'nomu', 'nyi' ...., so
> >veslopidnogo (bicycle) -> velosipedn~ogo
> >velosipednyi -> velosipedn~yi
> > while better to have
> >velosipednogo -> velosiped~nogo
> >velosipednyi -> velosiped~nyi
> >
> >I'm not a linguist, so I don't know how properly distinguish
> >'nogo' from 'ogo' etc. Probably there is some grammar rules.
>
>
>
>
> --__--__--
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
> End of Snowball-discuss Digest
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83