Given a document, how do you get its ID? (perl bindings)

Richard Boulton richard at tartarus.org
Mon May 9 21:26:27 BST 2016


Document does have a method for getting the numeric document ID:
Document::get_docid().  See
https://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#a03ff36283ac7d14a1a3b1c9fb26eff61.
However, if you're using a URL as the unique ID, getting Xapian's internal
numeric docid isn't of much use.

Instead, to find out the document ID using the method described in the
UniqueIds document in the FAQ, you can look for a term beginning with a "Q"
in the document. You'd could do it with a function something like (in
Python, and untested - I'm not up to date with the perl bindings)

def get_id_string_from_doc(doc):
    termlist = doc.termlist()
    termlist.skip_to("Q") # Advances the iterator to point to the first
term starting with a "Q" (more precisely, sorting after "Q")
    try:
        item = termlist.next()
    except StopIteration:
        raise KeyError("No ID in the document")
    term = item.get_term()
    # Should probably check that the term starts with a "Q", and raise an
error that the document doesn't have an identifier if it doesn't.
    return term[1:]  # Remove the leading "Q" from the term

On Mon, May 9, 2016 at 6:12 PM Alex Aminoff <aminoff at nber.org> wrote:

> I am writing an indexer that will crawl our web site. Following the
> recommendation here:
>
> https://trac.xapian.org/wiki/FAQ/UniqueIds
>
> I'm using the URL as the unique ID for each document. I see how to get a
> document from the xapian database if I know its URL, but what I need is
> also to be able to find out the URL from the document. Does this mean I
> need to store the URL in a value in addition to as a term? In fact I
> notice that there is no get_id method on a document object, so even if
> you use numeric IDs assigned by Xapian you can not get them from a
> document.
>
>   - Alex
>
>
>


More information about the Xapian-discuss mailing list