[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Robert Young bubblenut at gmail.com
Wed Jan 2 23:53:58 GMT 2008


Thanks, lots of really interesting information.

On Jan 2, 2008 8:15 PM, James Aylett <james-xapian at tartarus.org> wrote:
> Post list is the index that goes from a term to the list of documents
> that term indexes. Position list is the list of positions within a
> given document that a term appears at.
> Disabling positions in your index will remove the need for the
> position list. You can't avoid the post list, as it's the main thing
> you need :-)
> If you have a large number of unique terms being generated, you'll get
> a large database. There may be something to do with your term
> generation that's unexpected here - you can dump a list of terms with
> a little PHP script to find out what's going on, perhaps. (Maybe run a
> couple of documents only in and see if you're getting an expected list
> of terms.)
Yes, this may be an issue, I'm getting a couple of strange things happen;
- It doesn't look like the stemmer is doing anything, just as one
example of many, surely woman and women should have the same stem?
- How can I have 's removed from the end of terms?
- Wikipedia has lots of words in other languages (completely different
character sets) is there a way of getting the indexer to ignore terms
with characters outside a given range?
- There are lots of things getting indexed which I would not have
expeted to be indexed such as numbers and number string combinations
- All terms which start with a letter seem to be duplicated in
Z-prefixed terms with the same frequency as the unprefixed term,
what's this for?

I've had a read of the rest of your comments and they are very
interesting and informative. I'm not, however, going to take another
look at the other problems and possible solutions until I've managed
to reduce the number of terms being generated. Does that sound like a
sensible order?


More information about the Xapian-discuss mailing list