[Xapian-discuss] Getting spelling to work

Olly Betts olly at survex.com
Tue Jan 8 22:45:05 GMT 2008


On Tue, Jan 08, 2008 at 09:15:05PM +0000, James Aylett wrote:
> On Tue, Jan 08, 2008 at 03:55:07PM -0500, Deron Meranda wrote:
> > This seems to imply that the term/postings are used as the
> > basis for spelling, but in reality it looks like the spelling "index"
> > is actually quite separate from the term/positing index.
> > Is that true?  And why?
> 
> Yes, it's separate; you might not want it to be automatically filled
> with every word generated from your corpus (for instnace if your
> corpus has lots of spelling mistakes in it).

Another reason - you may only be indexing stemmed forms, but the
spelling corrections want to be unstemmed.

> > So assume I want the spelling dictionaryto be  based upon all the
> > terms in the documents (and not some predefined dictionary).
> 
> That will depend on your application, but that's a reasonable approach
> to take.

Sometimes you might want to use a predefined dictionary though, or you
may want to seed the spelling dictionary with a predefined dictionary.

Decoupling the spelling from the terms offers more flexibility.

> > How does the spelling word "frequency" affect things?  I would
> > assume that if there are multiple spelling suggestions, that the
> > one with the highest frquency would be returned (as the most
> > likely spelling).  This is sort of implied but not actually stated
> > anyplace I can find.
> 
> Pass. Richard?

Well, I'm not Richard, but I did write that code.  The frequency is
only used to decide between spelling corrections with the same edit
distance (considering insert character, remove character, change
character, and transpose two adjacent characters as each being an
"edit").  This is documented here:

http://www.xapian.org/docs/spelling.html

> > Then, most importantly, how does one then populate the spelling
> > dictionary when indexing documents?  Since every time you do
> > add_spelling() the frequency is incremented; what happens if I
> > want to re-index some document (or remove a document)?  For
> > the terms and postings, this is a valid thing to do.  Re-indexing
> > a document as many times as you want doesn't change things.
> > But if you're also adding it's terms to the spellings, then re-indexing
> > can seriously skew the frequencies it would seem.
> 
> Umm, no idea. Richard?

I'm still not Richard, but in general, I'd suggest that you don't worry
about it - spellings in documents which were in the corpus, or in older
versions, are still interesting.  You could think of this as using the
history of a collection as a source of spelling data.

If this really bothers you, you can run through the postings of the
document you're removing (if you indexed terms unstemmed) to remove
existing ones.  

Cheers,
    Olly



More information about the Xapian-discuss mailing list