[Xapian-discuss] Limitation of the terms size

Olly Betts olly at survex.com
Sun Mar 29 10:46:04 BST 2009


On Thu, Mar 26, 2009 at 03:52:10PM +0100, David Versmisse wrote:
> Thank you for your anwsers. With your propositions, we made this:
> 
> def _reduce_data(data):
>     # If the data are too long, we replace it by its sha1

You should probably decide if "data" is singular or plural!

>     if len(data) > 240:
>         if isinstance(data, unicode):
>             data = data.encode('utf-8')
>         return sha1(data).hexdigest()
>     # All OK, we simply return the data
>     return data
> 
> This function is called during the indexing and searching. This seems to
> work. The problem is the "path" is your ID for each document.

Yes, that's a good approach.

It's probably not a problem for paths (if they start with "/"), but more
generally if the hashed values can be valid values for data, you might
want to do something to ensure they can't clash, or you lose some of
the benefits of using SHA1 here.

Cheers,
    Olly



More information about the Xapian-discuss mailing list