[Xapian-discuss] Indexing of email
Jim Lynch
jim at fayettedigital.com
Sun Aug 20 17:39:59 BST 2006
James Aylett wrote:
> On Mon, Aug 21, 2006 at 12:01:30AM +1000, Michael Daly wrote:
>
>
>> Does xapian index email as contained within an email (either linux
>> or windows) program? Please answer in regards to both the emails and
>> attachments.
>>
>
> Xapian itself is a library for building applications that need
> indexing and search facilities. I have a script which will create
> omega-compatible indexes from mbox format email collections, which
> isn't really ready for prime time but I'm happy to send to anyone who
> is interested. It's in python, under the GPL. Let me know if you want
> a copy.
>
> James
>
>
I also have a system to index email but it's not even as far a long as
James' script. Since I have multiple sources for my email it's a bit
more complex than need be for a single mailbox. It goes something like
this:
I have two different sets of directories with Unix mailbox files on DVD,
I ran hypermail (via a perl script to filter things) to convert from
mbox format to html. This does two things, one it provides me with a
file/directory tree of one email per file that I can easily index and
two, it give me a way to look at individual mail messages via a web
interface. Hypermail also adds attachments, but I filter out binary
attachments so the files aren't so horribly big.
Essentially I do the same thing for a set of windows (Thunderbird)
folders that I have archived mail store in also. And once a day I do
the same for a set of Linux Thunderbird mail folders. Hypermail reads
both formats fine, since they are both in "mbox" format or close enough.
Then to index them into the Xapian database, I use find to enumberate
all fo the files in the html directories created by hypermail. List
list is fed into a perl script that looks for html files, doc files, pdf
files, etc. I then use an appropriate converter to convert these to
text, read them in and generate input for scriptindex. I collect a
number of sets of data for each file and then run scriptindex.
I actually have 3 different Xapian databases, so I can selectively
search the Irix set, the Windows set and the current Linux set. The
first two are static, but on a daily basis I run the current set. Since
it's not trivial to detect deleted mail messages, I just remove the
whole html set and start over each night. It takes a couple of hours,
but since I'm sleeping and the computer isn't doing anything useful
anyway, I don't care.
Someday I'll write up a simplified version of this and post it on the wiki.
Jim.
More information about the Xapian-discuss
mailing list