We are using Uber Cadence and periodically we run into issues on the production environment. The setup is the following:
- One Java 14 BE with Cadence client 2.7.5
- Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
- Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
- As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!