[Xapian-discuss] Stemming behavior

John Leach john at johnleach.co.uk
Fri Aug 21 18:42:21 BST 2009


On Fri, 2009-08-21 at 17:22 +0200, dimazest at gmail.com wrote:
> I use python xapian bindings to stem strings and get this behavior:
> 
> Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import xapian
> >>> xapian.version_string()
> '1.0.14'
> >>> s = xapian.Stem('en')
> >>> s('editing')
> 'edit'
> >>> s('Editing')
> 'Edite'
> 
> Is it a bug or a feature, that for the word 'Editing' different result
> is returned than for edit?

Hi Dima,

I think the stemmer is ignoring uppercase token prefixes.  So in the
second case it's actually stemming the word "diting".  This likely
related Xapian's term prefixes, which are all uppercase:

http://xapian.org/docs/omega/termprefixes.html

The stemming algorithm treats English words starting with
consonant-vowel-consonant differently, to handle words like duping ->
dupe, doting -> dote  etc.

Actually, it's more complicated than that:

http://snowball.tartarus.org/algorithms/english/stemmer.html

John.

-- 
http://johnleach.co.uk




More information about the Xapian-discuss mailing list