[Snowball-discuss] Stop word lists
Oleg Bartunov
oleg@sai.msu.su
Fri Oct 18 20:09:02 2002
On Fri, 18 Oct 2002, Martin Porter wrote:
>
> Oleg,
>
>
> I have incorporated your corrections into the Russian stop word list. The
> reason they arose was that Pat Miles prepared the list in the transliterated
> form, and I never showed him the Cyrillic equivalent. It is of course easy
> to make mistakes if you're not used to the transliteration scheme.
>
> (I might say that I now regard the Library of Congress transliteration
> scheme as very unnatural. Even so, I can't think up anything better that
> guarantees two-way translation.)
Martin, could you try virtual keyboard
http://lingvo.yandex.ru/index_keyboard_qwerty.html
>
> I've looked at the list you sent me, and it seems to contain paradigm forms
> only - at least for some of the words. So kak is there, but not kakai^a,
> kakoi`, kakov, kakovo. My list also omits some of these. Actually, it is not
> easy for me to put together a more complete list. I am beginning to suffer
> by not being able to input Cyrillic at the keyboard, which is a great
> nuisance. So if you would like to take control of the Russian stop word list
> for the Snowball site you are more than welcome!
I have now a list of russian words ranged by frequency. I got it from
recent crawl of 10 mln. pages. Unfortunately, I'm very busy but I'll
try to do something for snowball site.
>
> A few questions:
>
> Is KOI8-R fairly universal in Russia for representing Cyrillic? Other
> codings are mentioned in the browsers: ISO-8859-S, CP-866 etc - I've no idea
> what they mean. Are any of them ever used?
>
Almost all of them are in use ! We have special module for apache web server
to convert encodings. koi8-r is used mostly in Unix environment and mails,
while cp-1251 - in Windows, CP-866 - in Dos, ...
read about koi8-r http://koi8.pp.ru/framed-koi8.html
> I've added a note in the stopword list that e" (e with two dots) is
> translated to e, as you advise. But is e" ever used outside dictionaries and
> grammars in Russian? I know what it means (-e- pronounced heavy as o, as in
> 'Gorbache"v'), but I thought it was always printed as 'e'.
In practice, e" used in printed forms. And most search engines just
translates it to 'e'. I even forgot where is e" on my keyboard :)
>
> I know that some languages that use Cyrillic (but not of course Russian)
> have accented Cyrillic letters. Is there a standard way of encoding these in
> KOI8-R?
>
There is a page about this problem - http://peoples.org.ru/eng_index.html
http://peoples.org.ru/eng_alfavit.html
The main conclusion is to use Unicode
>
> Martin
>
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83