1

I want to attribute an ID to every document in a vespa cluster.

But I don't completely understand how visitors work in vespa.

Can I get a shared field (meaning shared by all instances of my visitor), which I can atomically increment (using some lock) every time I visit a document ?

What I tried obviously doesn't work, but you'll see the general idea :

public class MyVisitor extends DocumentProcessor {

    // where should i put this ? 
    private int document_id;

    private final Lock lock = new ReentrantLock();

    @Override
    public Progress process(Processing processing) {
        Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
        while (it.hasNext()) {

            DocumentOperation op = it.next();
            if (op instanceof DocumentPut) {

                Document doc = ((DocumentPut) op).getDocument();
                /*
                 * Remove the PUT operation from the iterator so that it is not indexed back in
                 * the document cluster
                 */
                it.remove();

                try {
                    try {
                        lock.lock();
                        document_id += 1;
                    } finally {
                        lock.unlock();
                    }
                } catch (StatusRuntimeException | IllegalArgumentException e) {
                }
            }
        }
        return Progress.DONE;
    }
}

Another idea it to get the number of buckets and the bucket id I'm currently dealing with and to increment using this pattern:

document_id = bucket_id
document_id += bucked_count

which would work (if I can ensure my visitor operates on a single bucket at a time) but I don't know how to get these information from my visitor.

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41

1 Answers1

1

Document processors operate on incoming document writes, so they cannot be applied to the result of visiting (not without a bit more setup anyway).

What you can do to visit the documents instead is to just get all the documents using HTTP/2: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit

Then use the same API to issue an update operation for each document to set the field using the same API: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#put

Since this is done by a single process, you can then have a document_id counter which assigns unique values.

As an aside, a common trick to avoid that requirement is to generate an UUID for each document.

Jon
  • 2,043
  • 11
  • 9
  • Thanks. I edited with another idea I have. – Regis Portalez Mar 01 '22 at 12:33
  • There aren't any guarantees that documents will be returned from only one bucket at the time, and you would need to ensure that you use additional bits, not the same bits, for your counter. Unless this is personal data where a mixup would be a disaster you can just hash the document id to some number if you really want a number and not an UUID: – Jon Mar 01 '22 at 12:53
  • Thanks @jon. Seems difficult to assign an integer to each document with multiple processes and without post or processing then. – Regis Portalez Mar 01 '22 at 13:18
  • Yes, that is a fundamental problem in computer science (or, really physics). The common solution if you really need it is to use timestamps from on-board atomic clocks synchronized by satellite. – Jon Mar 01 '22 at 18:05