[Xapian-discuss] Newbie questions about omega

Julien Pfefferkorn julien.pfefferkorn at googlemail.com
Tue Apr 16 20:23:00 BST 2013


Dear all,

I have create 2 new trac tickets:

For the 1st issue I mentioned:
http://trac.xapian.org/ticket/618

For the 3rd issue I mentioned:
http://trac.xapian.org/ticket/619

The 2nd issue seems to be already covered by:
http://trac.xapian.org/ticket/519

Regards

Julien


-----Ursprüngliche Nachricht-----
Von: Olly Betts [mailto:olly at survex.com] 
Gesendet: Montag, 15. April 2013 08:53
An: Julien Pfefferkorn
Cc: Xapian Mailing list
Betreff: Re: [Xapian-discuss] Newbie questions about omega

On Wed, Apr 03, 2013 at 03:16:50PM +0200, Julien Pfefferkorn wrote:
> I noticed that Omega indexes file names. The file name seems to indexed as
> several words if the name contains space characters.
> 
> In my share I often separate words in the file name using "-" or "_" or
even
> using a capital letter at the beginning of each word (I guess this is also
> the case for many other users):
> 
> Examples:
> 
> this-is-a-file.txt
> 
> this_is_a_file.txt
> 
> thisIsAFile.txt
> 
> In those cases, a noticed that omega does not index the individual words,
> but only the full basename as one single word.

The last two are true, but you're incorrect about hyphens:

$ mkdir test
$ echo hello > test/this-is-a-test.txt
$ omindex --verbose --db tmp.db test
omindex: --url not specified, assuming '/'.
[Entering directory ""]
Indexing "this-is-a-test.txt" as text/plain ... added
$ delve -r1 tmp.db
Term List for record #1: D20130415 Etxt I* M201304 Oolly P/ Ttext/plain
U/this-is-a-test.txt Y2013 Za Zhello Zis Ztest Zthis a hello is test this

> It would be helpful, if omega would index each respective word, to ease
the
> search.

Currently the leafname is just handled the same way as text inside the
document.

We need to handle it the same way or else typing the leafname in as a search
wouldn't match the file in such cases, which would be confusing.  But we
could
additionally index it split at punctuation and/or case transitions.  I'm not
sure exactly what the best algorithm would be though.

> Is it planned to add that feature in omega? Should I write a feature
request
> in trac?

Yes, that's the best way to make sure a suggestion doesn't get lost.

> It seems that omega does not index the file name if the MIME type cannot
be
> indexed.
> 
> In order to be able to search all files by their name, it would be
helpful,
> if omega would index the file name in that case.

Yes, we don't index files unless we know how to.

You can make this happen for particular mimetypes with a dummy filter:

  --filter=application/octet-stream:/bin/true

But there's no way to tell it to do that for all unknown types currently.

> Is it planned to add this feature in omega? Should I write a feature
request
> in trac?

Yes.

> It seems that omega does not currently index folder names
> 
> In order to be able to search for folder by its name, it would be helpful,
> if omega would index it.
> 
> Is it planned to add this feature in omega? Should I write a feature
request
> in trac?

Only indexing the leafname was a deliberate choice - the thinking was that
indexing the folder name for every file would make searches including a
word from the folder name very noisy, since every file in such a folder
would match.

It could probably be an optional feature, or perhaps it wouldn't actually
be problematic in practice.

Cheers,
    Olly


More information about the Xapian-discuss mailing list