[Xapian-discuss] Multilingual issues with Xapian
Olly Betts
olly at survex.com
Thu Oct 11 12:17:23 BST 2007
On Thu, Oct 11, 2007 at 02:09:10AM +0200, Ron Kass wrote:
> During indexing, the stemmer was not set since the text is non-english.
This is the nub of your problem really - indexing and searching need to
be done in a compatible way. If you didn't set a stemmer during
indexing, you shouldn't set one during searching.
> Naturally if I stem documents during indexing to the proper language, such
> stemming should be done on words during searching. So that if a user is
> searching for a Russian word, the russian stemming should be applied on the
> word, and not english. However while it is relatively easy/possible to
> detect a language of a document, a single word's language is not so
> simple/possible.
As you point out, it's not possible in general to detect the language of
a single word. Indeed many words are valid in more than one language.
I think trying to build a system with documents in multiple languages is
actually inherently a hard problem. I've seen various different
approaches tried over the years, and they all have drawbacks.
> What if instead of stemming all the words in a document, even if they have
> no real stemmed form, the stemmer (during indexing) was to stem only words
> that it knows having a stemmed form?
As James points out, the Snowball stemmers are algorithmic. You always
get a stemmed form (well, unless the algorithm fails to terminate, but
they're carefully written to avoid that). It's actually a nice feature
in many ways - for example, neologisms get handled without having to
update the algorithms.
I can see you could look at the alphabet used in a word - a word in
the cyrillic alphabet clearly isn't an English word. But most alphabets
are used by more than one language, and a word in the cyrillic alphabet
may not be Russian either...
> Even without the multiple language stemming, if the stemmer doesn't try to
> stem words it doesn't have a stemmed form for, it would solve the problem
> as a word ???????? which it doesn't recognize, will be both indexed AND
> parsed/searched in the unstemmed original form.
But the English stemmer shouldn't modify words in non-Latin alphabets
(if it does, I'd say that's a bug). So if you just set the stemmer
consistently for indexing and searching, everything will work as it is
now.
Cheers,
Olly
More information about the Xapian-discuss
mailing list