[Xapian-discuss] queryparser thinks

Olly Betts olly at survex.com
Tue Sep 13 14:35:45 BST 2005

On Tue, Sep 13, 2005 at 11:31:19AM +0200, Ralf Mattes wrote:
> On Tue, Sep 13, 2005 at 05:08:08AM +0100, Olly Betts wrote:
> > It's more germanocentric if anything.  
> Well, but in German 'accents' (umlauts et. al.)  _do_ carry meaning.

Yes, but there's a standard way to write a word if you can't (or don't
know how to) write the accents.  There are also regional variations.
For example, I'm told that ß is rarely used in Swiss German
speaking areas - instead they write "ss".  And the orthography change a
few years back means that some words which were formally written with
ß now often aren't in Germany itself (I believe "muss" instead of
"muß" is a common example).

This is also the case when writing in capital letters as there's no
capitalised form of ß (so you see "EINBAHNSTRASSE" for "one way

> > The transliteration should also really be language dependent - in German
> > ä -> ae, 
> That's a typographic convention used in circumstances where Umlaut
> glyphs aren't available (1970 TELETYPE ....).

It's true that this is probably less useful than it was back with
Muscat 3.6 (about 10 years ago).  But it seems a lot of people still
use their teletypes:

http://www.google.co.uk/search?q=H%C3%B6hle gives 2940000 hits

http://www.google.co.uk/search?q=Hoehle gives 267000 hits

That's about 9%.

Interestingly, if you look at the results for the first it seems Google
simple drops the umlaut when matching so that 2940000 includes a number
of hits for "Hohle".  That's worse than what we do, and also means that
the 2940000 is probably a slight overestimate.

> Presenting such a conversion to todays (web) users gives a rather
> archaic touch to the website.

But this all happens behind the scenes (at least as much as possible).
The stemmed form is what actually gets searched for, but when listing
which terms match which documents, Omega maps this back to the forms
the user actually entered.  Even in $topterms we try hard to avoid
presenting the stemmed form (though we don't always manage it).

> What was the reason for not using the latest snowball version in Xapian?

As James said, there have been some minor tweaks to the algorithms since
the last time we imported a version.  Changing the algorithms makes
existing databases incompatible so we avoid doing it too often, and
try to do it in step with other database incompatible changes (and
not at a point release).


More information about the Xapian-discuss mailing list