[Snowball-discuss] RE: Snowball-discuss digest, Vol 1 #5 - 1 msg

Oleg Bartunov oleg@sai.msu.su
Mon Sep 9 10:46:01 2002


Svetlana - you're my hero ! I'd like to test russian stemmer
(current version is available: http://intra.astronet.ru/db/lingua/snowball/)
I'm working on development of Russian Scientific Network (www.nature.ru)
and we found stemming is very important for scientific corpora.

	Oleg
On Mon, 9 Sep 2002, Svetlana Pereyaslavets wrote:

> Dear Martin,
> I am not a linguist, but a native Russian speaker. May I try to give some
> explanation on this suffix.
> Free, but hopefully helpful :-)
> It is a very common in Russian adjectives and adverbs when we deal with a
> construction:
>
> *****basic*construction********
>   prefix-root  - "other optional" suffix - n (Oleg's question) - <adjective
> ending> ( = yi/iy/oy....through all genders and declinations)
> *******************************
> The rule has the following options:
> 1.  prefix-root  - n - <adjective ending >
>
> 	1.1. The root itself ends on  -n-
> 	In this case we will encounter -nn- after stripping the adjective ending,
> and we SHOULD REMOVE one -n- (that is the 	suffix).
> 	Such words usually don't have prefixes (so can be easily compared to the
> dictionary).
> 	Example : kon-n-yi (adjective from "kon'"=horse)
>
> 	1.2  The root ends on any other letter
>
> 	we SHOULD REMOVE the -n- (that is the suffix).
> 	Example: ruch-n-oy (adjective from "ruka"= hand).
>
> 2.  prefix-root  - "other optional"suffix - n - adjective ending
>
>
> 	2.1. other optional suffix = - an- or - yan -
>  	- a- or -ya- SHOULD BE REMOVED TOGETHER with the suffix -n-.
>
> 	Example: "sherst-yan-oy" (=woolen).
>
> 	THREE exceptions from this rule would fall under case 2.2:
> 	"stekl-yan -n - <adjective ending>" (adj from glass)
> 	"olov-yan -n - <adjective ending>" (adj from tin)
> 	"derev-yan -n - <adjective ending>" (adj from wood)
>
> 	2.2.  other optional suffix = -on - or -en-
>
> 	REMOVE -n- and following -en- or -on-.
>
> 	Example: "osob - en- n- <adjective ending>" (=special)
>
> 	ONE exception from this rule would fall under case 2.1:
> 	"ran -en- <adjective ending>" (= injured)
>
> 	2.3. HARD CASE (RUSSIAN LEXICAL DIVERSITY IS INVOLVED) - I can't suggest a
> solution right now, as I need time to think 	how to detect that without
> knowledge of the natural language:
>
>  	other optional suffix = -in
>
> 		2.3.1. If the following substitution is valid refer to 1.1. or 2.1
> (i.e. -n- SHOULD BE REMOVED, following -in- 		siffix MAY and probably SHOULD
> be removed depending on the required detailisation)
>
>             		  - a (noun)
> 		            /
> 		root - in -|
>             		\
> 				  -n- <adjective ending>
>
>
> 		Example: "star-in-a"-"star-in-n -yi" (= old)
>
> 		2.3.2 If the substitution  above is not valid refer to 1.2. or 2.2. with
> the same reservation.
>
> 		Example: "mysh-in - <adjective ending> " (adjective from "mysh'"= mouse)
>
> 3. PARTICIPLE II may look the same as an adjective for an end-stripping
> stemmer.
> In Participles II, the scheme is :
>
>  word - {-on, -en, -an, -yan} - n - <adjective ending>
>
> Where "word" is VERY LIKELY to consist of "prefix-root" (i.e. there is a
> high probability that participle II would have a prefix).
>
>
>
>
> It may look too complicated, please email if you need to clarify something.
> Or, please allow me some time to return to this topic and come up with a
> digestable algorithm. Actually, I was planning to test Russian stemmer in
> the scope of my student research in December this year.
>
> Kind regards
>
> Svetlana
>
>
>
>
>
>
> -----Original Message-----
> From: snowball-discuss-admin@lists.tartarus.org
> [mailto:snowball-discuss-admin@lists.tartarus.org]On Behalf Of
> snowball-discuss-request@lists.tartarus.org
> Sent: Monday, September 09, 2002 5:45 PM
> To: snowball-discuss@lists.tartarus.org
> Subject: Snowball-discuss digest, Vol 1 #5 - 1 msg
>
>
> Send Snowball-discuss mailing list submissions to
> 	snowball-discuss@lists.tartarus.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> or, via email, send a message with subject or body 'help' to
> 	snowball-discuss-request@lists.tartarus.org
>
> You can reach the person managing the list at
> 	snowball-discuss-admin@lists.tartarus.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Snowball-discuss digest..."
>
>
> Today's Topics:
>
>    1. Re: russian stemmer (Martin Porter)
>
> --__--__--
>
> Message: 1
> To: Oleg Bartunov <oleg@sai.msu.su>
> From: martin_porter@softhome.net (Martin Porter)
> Cc: snowball-discuss@lists.tartarus.org
> Date: Sun, 08 Sep 2002 23:12:26 -0600
> Subject: [Snowball-discuss] Re: russian stemmer
>
>
> Oleg,
>
> I've had a look at -n-ogo, -n-yi etc endings through the Russian vocabulary,
> and feel that I would need to take linguistic advice before I could make any
> progress with -n- removal.
>
> As you may recall, I did the Russian stemmer with a linguist, Pat Miles, who
> lives some 60 miles away, and is not really a computer user. Also, Pat
> charges for his work, which is a further inconvenience to me! I'd rather try
> to get free linguistic help now through the open source community. Is there
> anyone you know in Russia who might experiment a bit further with the
> Snowball stemmer to see if they could make improvements here?
>
> Martin
>
> >current russian stemmer seems doesn't treat adjective endings like:
> >'nogo', 'nomu', 'nyi' ...., so
> >veslopidnogo (bicycle) -> velosipedn~ogo
> >velosipednyi -> velosipedn~yi
> > while better to have
> >velosipednogo -> velosiped~nogo
> >velosipednyi ->  velosiped~nyi
> >
> >I'm not a linguist, so  I don't know how properly distinguish
> >'nogo' from 'ogo' etc. Probably there is some grammar rules.
>
>
>
>
> --__--__--
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
> End of Snowball-discuss Digest
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83