1

We have noticed excessive logging from the TrackingEventProcessor class when scaling up the micro-service to 2 replicas:

Our Axon setup:

  • axon version 3.4.3
  • spring-boot version 2.1.6.RELEASE
  • an in-house couchbase implementation as the TokenStore, i.e. CouchBaseTokenStore
  • PostgreSQL v11.7 as event store
  • Segment count for each tracking event processor = 1
  • Tracking Event Processors are configured forSingleThreadedProcessing

We are seeing the following messages a lot:

Segment is owned by another node. Releasing thread to process another segment...
No Worker Launcher active. Using current thread to assign segments.
Using current Thread for last segment worker: TrackingSegmentWorker{processor=core_policy-view, segment=Segment[0/0]}
Fetched token: GapAwareTrackingToken{index=2369007, gaps=[]} for segment: Segment[0/0]

Here is an example for a single processing group called core_policy-view:

enter image description here

This is from our staging environment. We don't see the same behaviour in our Production environment that is running even more replicas of the same micro-service (5 instead of 2), albeit a previous version. The excessive logging is of concern, because it causes the following query to also be executed excessively against the event store:

SELECT min ( globalIndex ) FROM DomainEventEntry WHERE globalIndex > ?

Here is a graph of how it impacted the APMs when the micro-service was scaled up to 2 replicas:

enter image description here

By looking at the code in the TrackingEventProcessor class I understand that the TrackingEventProcessor is responsible for assigning segments to the running instance. If the segment is already claimed then a specific instance should not bother much any further. It seems that this is however not happening. What can also sometimes happen is that the segment/token ping-pongs between the two nodes.

At the moment I may not even be articulating the issue correctly and I'm also unsure about what questions to ask. I would appreciate it if anyone can shed some light on whether we are doing something wrong. I know the version of axon we are using is old, and we have it on the roadmap to upgrade, but for now we need to get the current changes of the system into Production so that we can move forward and start with the upgrade after that.

EDIT 1

Here are the methods in CouchBaseTokenStore that are responsible for claiming the token:

@Override
public void storeToken(TrackingToken token, String processorName, int segment) throws UnableToClaimTokenException {
    // Consider tracking CAS to avoid re-read and harness the CAS optimistic concurrency better.
    // Unfortunately, GapAwareTrackingToken is expected by RDBMS storage engines and can't be extended.
    JsonDocument doc = readOrCreateDocument(processorName, segment);
    if (GapAwareTrackingToken.class.isAssignableFrom(token.getClass())) {
        writeGapAwareToken((GapAwareTrackingToken) token, doc);
    } else {
        writeTrackingToken(token, doc);
    }
    this.axonStateBucket.upsert(doc);
}

@Override
public TrackingToken fetchToken(String processorName, int segment) throws UnableToClaimTokenException {
    JsonDocument doc = readOrCreateDocument(processorName, segment);
    String tokenClass = doc.content().getString(TOKEN_CLASS_FIELD);
    if (tokenClass == null) {
        return readGapAwareToken(doc);
    } else {
        return readTrackingToken(doc);
    }
}

private JsonDocument readOrCreateDocument(String processorName, int segment) throws UnableToClaimTokenException {
    String docId = getId(processorName, segment);
    JsonDocument doc = this.axonStateBucket.get(docId);
    if (doc == null) {
        try {
            doc = createDocument(processorName, segment);
        } catch (DocumentAlreadyExistsException e) {
            // Another instance beat us to it, read new token
            // which will most likely not be claimable.
            doc = this.axonStateBucket.get(docId);
        }
    }
    claimToken(doc);
    return doc;
}

private JsonDocument createDocument(String processorName, int segment) throws DocumentAlreadyExistsException {
    JsonObject content = JsonObject.create()
        .put(PROCESSOR_NAME_FIELD, processorName)
        .put(SEGMENT_FIELD, segment)
        .put(TYPE_FIELD, TOKEN_TYPE)
        .put(CLAIM_EXPIRY_FIELD, formatInstant(Instant.now().plus(claimDuration)))
        .put(OWNER_FIELD, nodeName);
    JsonDocument doc = JsonDocument.create(getId(processorName, segment), content);
    return this.axonStateBucket.insert(doc);
}

private void claimToken(JsonDocument document) throws UnableToClaimTokenException {
    String originalOwner = document.content().getString(OWNER_FIELD);
    Instant originalClaimExpiry = DateTimeUtils.parseInstant(document.content().getString(CLAIM_EXPIRY_FIELD));
    document.content()
        .put(CLAIM_EXPIRY_FIELD, formatInstant(Instant.now().plus(claimDuration)))
        .put(OWNER_FIELD, nodeName);
    if (nodeName.equals(originalOwner)) return;
    if ((originalClaimExpiry).isAfter(clock.instant())) {
        throw new UnableToClaimTokenException(String.format("Claim for owner %s is still valid.", originalOwner));
    }
}


HeatZync
  • 160
  • 1
  • 10
  • The responsibility for allowing a segment to be claimed is in the TokenStore implementation. Could be that the CouchBaseTokenStore implementation you created doesn't verify whether a token is already claimed before assigning it. Can you share the code (or relevant snippets) in the question? – Allard Apr 08 '21 at 13:29
  • @Allard: The CouchBaseTokenStore implementation does verify and either allow or deny (by throwing an UnableToClaimTokenException). I will however edit the question and add the code for completeness. – HeatZync Apr 08 '21 at 13:58
  • @Allard: I've edited the question with the code. The log messages we are seeing is as a result of the UnableToClaimTokenException being thrown. – HeatZync Apr 08 '21 at 14:17
  • I am missing the persistence of the claim owner and timestamp in the claim method. The fields are updated in the document, but I don't see the document being updated anywhere. Right now, it seems you can run into conflicts where 2 nodes simultaneously claim a Segment and only fail once they try to update the token. – Allard Apr 08 '21 at 14:30
  • The insert happens in the createDocument() method. There is also an upsert that happens in the overridden storeToken() method. I have added both methods to the code block in the question. – HeatZync Apr 09 '21 at 06:58
  • I think there is a concurrency issue in the Token Store implementation. When updating the document, you must ensure that the document hasn't been updated by another process in the meantime. Otherwise, claims can be accidentally "stolen" by another process. The upsert, of which I don't see the code above should use a CAS operation to ensure atomic compare-and-update semantics. – Allard Apr 12 '21 at 12:20
  • Thanks Allard. We have in the meantime switched from the CouchBaseTokenStore to the Axon provided JdbcTokenStore (with backing PostrgeSQL). The excessive token claiming behaviour immediately stopped. When you mentioned CAS, I then also read the couchbase SDK documentation and the upsert(Document) method that is called in the above storeToken() method mentions: "Please note that this method will not use the Document.cas() for optimistic concurrency checks. If this behavior is needed, the replace(Document) method needs to be used." So, further code changes might be in order. – HeatZync Apr 12 '21 at 14:16
  • 2
    @Allard: You hit the nail on the head with your 14:30 comment on 8 April 2021. The claim wasn't being persisted in the `fetchToken()` method. I will answer the question below showing the code changes that were needed (including the `replace()` instead of the `upsert() for good measure). – HeatZync Apr 15 '21 at 19:48

1 Answers1

1

I managed to fix the problem with the help of Allard (see comments on question). The fix was to also persist the token after it has been claimed in the fetch() method. We also started making use of the replace() method supplied by the Couchbase SDK instead of the upsert() method, to better harness the CAS (Compare-and-Swap) optimistic concurency:

@Override
public void storeToken(TrackingToken token, String processorName, int segment) throws UnableToClaimTokenException {
    JsonDocument doc = readOrCreateDocument(processorName, segment);
    if (GapAwareTrackingToken.class.isAssignableFrom(token.getClass())) {
        writeGapAwareToken((GapAwareTrackingToken) token, doc);
    } else {
        writeTrackingToken(token, doc);
    }
    axonStateBucket.replace(doc);
}

@Override
public TrackingToken fetchToken(String processorName, int segment) throws UnableToClaimTokenException {
    JsonDocument doc = readOrCreateDocument(processorName, segment);

    axonStateBucket.replace(doc); // readOrCreateDocument method claims, so we need to persist that

    String tokenClass = doc.content().getString(TOKEN_CLASS_FIELD);

    if (tokenClass == null) {
        return readGapAwareToken(doc);
    } else {
        return readTrackingToken(doc);
    }
}

The rest of the code remained the same as per the code block in the question.

HeatZync
  • 160
  • 1
  • 10