[Xapian-discuss] Multilingual issues with Xapian

Thu Oct 11 12:17:23 BST 2007

On Thu, Oct 11, 2007 at 02:09:10AM +0200, Ron Kass wrote:
> During indexing, the stemmer was not set since the text is non-english.

This is the nub of your problem really - indexing and searching need to
be done in a compatible way.  If you didn't set a stemmer during
indexing, you shouldn't set one during searching.

> Naturally if I stem documents during indexing to the proper language, such 
> stemming should be done on words during searching. So that if a user is 
> searching for a Russian word, the russian stemming should be applied on the 
> word, and not english. However while it is relatively easy/possible to 
> detect a language of a document, a single word's language is not so 
> simple/possible.

As you point out, it's not possible in general to detect the language of
a single word.  Indeed many words are valid in more than one language.

I think trying to build a system with documents in multiple languages is
actually inherently a hard problem.  I've seen various different
approaches tried over the years, and they all have drawbacks.

> What if instead of stemming all the words in a document, even if they have 
> no real stemmed form, the stemmer (during indexing) was to stem only words 
> that it knows having a stemmed form?

As James points out, the Snowball stemmers are algorithmic.  You always
get a stemmed form (well, unless the algorithm fails to terminate, but
they're carefully written to avoid that).  It's actually a nice feature
in many ways - for example, neologisms get handled without having to
update the algorithms.

I can see you could look at the alphabet used in a word - a word in
the cyrillic alphabet clearly isn't an English word.  But most alphabets
are used by more than one language, and a word in the cyrillic alphabet
may not be Russian either...

> Even without the multiple language stemming, if the stemmer doesn't try to 
> stem words it doesn't have a stemmed form for, it would solve the problem 
> as a word ???????? which it doesn't recognize, will be both indexed AND 
> parsed/searched in the unstemmed original form.

But the English stemmer shouldn't modify words in non-Latin alphabets
(if it does, I'd say that's a bug).  So if you just set the stemmer
consistently for indexing and searching, everything will work as it is
now.

Cheers,
    Olly