Uber Cadence task list management

Question

We are using Uber Cadence and periodically we run into issues on the production environment. The setup is the following:

One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB

There are multiple domains, for all domains the single BE server is registered as a worker.

What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:

"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...

In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.

At this point the investigation is stuck, so some help would be appreciated.

Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?

Any help would be more than welcome! Thanks!

score 1 · Answer 1 · answered Apr 21 '21 at 20:27

Does anyone know about how a worker or task list can lose its ability to be a decision handler

This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.

As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue

Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).

From your logs:

"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"

This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.

It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

Uber Cadence task list management

1 Answers1