[Xapian-discuss] Stemming behavior
john at johnleach.co.uk
Fri Aug 21 18:42:21 BST 2009
On Fri, 2009-08-21 at 17:22 +0200, dimazest at gmail.com wrote:
> I use python xapian bindings to stem strings and get this behavior:
> Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import xapian
> >>> xapian.version_string()
> >>> s = xapian.Stem('en')
> >>> s('editing')
> >>> s('Editing')
> Is it a bug or a feature, that for the word 'Editing' different result
> is returned than for edit?
I think the stemmer is ignoring uppercase token prefixes. So in the
second case it's actually stemming the word "diting". This likely
related Xapian's term prefixes, which are all uppercase:
The stemming algorithm treats English words starting with
consonant-vowel-consonant differently, to handle words like duping ->
dupe, doting -> dote etc.
Actually, it's more complicated than that:
More information about the Xapian-discuss