[Xapian-discuss] UTF8 support plans (without stemming)

Thu Apr 28 13:37:02 BST 2005

On Thu, Apr 28, 2005 at 11:08:28AM +0400, Alexandre wrote:
> Anyway, usually, when application/library was developed to support only 
> one language (american/english)

If you're going to point fingers, point them in the right direction -
none of the major contributors to Xapian are from the USA...

> it's very hard to make it work with other languages (for example, with
> russian) - there are lots of problems inside...

The problems aren't anywhere near as great as you seem to expect, at
least in part because unicode support has always been a goal we've kept
in mind.

As the message you gave the URL for says, there are only two parts of the 
library which make any sort of character set assumptions - the stemmers
and the query parser (everything else just handles things opaquely).

Both of these current assume latin1, which covers a lot of languages
other than English, although not as many as we'd like (actually this
isn't quite true - the Russian stemmer assumes KOI-8R).

In order to remove this assumption, we really need to have versions of
the stemmers which understand utf-8.  That requires some changes to the
snowball code generator (which I've been hoping someone else will be
motivated enough to do before I get around to it).

The query parser also needs some modifications as it currently performs
some normalisation of accents.  This is pretty trivial to fix - I
actually already have a hacked version I'm using for gmane which I'll
commit once it's working well and I've cleaned it up.

If you're interested in making this happen more quickly, help is
certainly appreciated.

> I just suppose, that computer can work well with lots of data, while 
> human brain can make some sort of decisions. No, I'm not for boolean 
> search, but I just didn't like probabilistic approach too much (when 
> machine tries to be smart)... I can (and probably is) absolutely wrong, 
> that's why I interested why people choose such approach.

If you don't like probabilistic retrieval and you don't favour boolean
search, what do you think is the right approach?

Since the user can't look more than a handful of results at once, you
really need to return ranked results, so you need some ranking method.

At one level probabilistic is such a ranking method (although it also
offer relevance feedback, which gives you somewhere to go if you aren't
happy with the initial search results).

Any ranking technique is going to make mistakes - even a human librarian
sitting in a room with a teletype isn't going to return the results the
user wants every time.

But you can measure how effective a ranking technique is by measuring
precision and recall for a number of test queries.  Such academic
studies over the years have shown the probabilistic model to be very
effective.  TREC is a good place to start if you want to look at the
results of such tests:  http://trec.nist.gov/

Divergence from Randomness (DFR) is another promising model, which has
been developed more recently.  I've actually been looking at
implementing it in Xapian.  It's fairly well suited, except that the
term weights don't seem to be bounded, but that really just means one of
the Xapian matcher's optimisations can't be used.

Cheers,
    Olly