[Snowball-discuss] Stop word lists

Martin Porter martin_porter@softhome.net
Fri Oct 18 18:54:01 2002


Oleg,


I have incorporated your corrections into the Russian stop word list. The
reason they arose was that Pat Miles prepared the list in the transliterated
form, and I never showed him the Cyrillic equivalent. It is of course easy
to make mistakes if you're not used to the transliteration scheme.

(I might say that I now regard the Library of Congress transliteration
scheme as very unnatural. Even so, I can't think up anything better that
guarantees two-way translation.)

I've looked at the list you sent me, and it seems to contain paradigm forms
only - at least for some of the words. So kak is there, but not kakai^a,
kakoi`, kakov, kakovo. My list also omits some of these. Actually, it is not
easy for me to put together a more complete list. I am beginning to suffer
by not being able to input Cyrillic at the keyboard, which is a great
nuisance. So if you would like to take control of the Russian stop word list
for the Snowball site you are more than welcome!

A few questions:

Is KOI8-R fairly universal in Russia for representing Cyrillic? Other
codings are mentioned in the browsers: ISO-8859-S, CP-866 etc - I've no idea
what they mean. Are any of them ever used?

I've added a note in the stopword list that e" (e with two dots) is
translated to e, as you advise. But is e" ever used outside dictionaries and
grammars in Russian? I know what it means (-e- pronounced heavy as o, as in
'Gorbache"v'), but I thought it was always printed as 'e'.

I know that some languages that use Cyrillic (but not of course Russian)
have accented Cyrillic letters. Is there a standard way of encoding these in
KOI8-R?


Martin