[Xapian-discuss] queryparser thinks

Ralf Mattes rm at fabula.de
Tue Sep 13 10:31:19 BST 2005


On Tue, Sep 13, 2005 at 05:08:08AM +0100, Olly Betts wrote:
> On Sun, Aug 28, 2005 at 02:14:15PM +0200, R. Mattes wrote:
> > Yes, the queryparser itself modifies characters. The code that does this
> > is in 'xapian/xapian-core/queryparser/accentnormalisingitor.h'. IMHO
> > this is a rather "murky" and anglocentric part of the Xapian codebase.
> 
> It is perhaps murky, but not really anglocentric - very few English
> words use diacritical marks, and the remaining few seem to be
> disappearing.

Ok, what i menat with 'anglocentric' was: in English this kinf of accent
modification doesn't really harm the language.

> It's more germanocentric if anything.  

Well, but in German 'accents' (umlauts et. al.)  _do_ carry meaning.

> This accent normalisation arises
> out of what we usually used to do with Muscat 3.6.  Back then the
> stemming algorithms had some quirky scheme of their own for representing
> accents (it involved '^'), but we eschewed this in favour of simply
> normalising accents before stemming.  This was easier than trying to
> translate them into '^'-form, and had the additional benefit that
> searches with the accents transliterated would match documents where
> they weren't and vice versa.
> 
> The main downside is occasional conflation of terms which shouldn't
> be (not just in Norwegian - for example the french for "peach" and
> "fish" differ only by accents, and I suspect examples can be found
> in other languages).
>
> The transliteration should also really be language dependent - in German
> ä -> ae, 

That's a typographic convention used in circumstances where Umlaut
glyphs aren't available (1970 TELETYPE ....). Presenting such a
conversion to todays (web) users gives a rather archaic touch to
the website.

> but that's not appropriate in Swedish I believe.  But
> language dependent normalisation is what the stemming algorithms do!  So
> I think this really should get folded into the stemming algorithms in
> languages where it makes sense (and languages where it doesn't wouldn't
> do anything).

Yes, exactly. That's why i was so eager to get the unicode support. BTW,
we currently use the utf8-aware stemmers from the snowball distribution.
What was the reason for not using the latest snowball version in Xapian?

 Cheers 

> Cheers,
>     Olly
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss




More information about the Xapian-discuss mailing list