1

The other day we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in "Scheduled" states. After checking the logs, we noticed that the matching service was throwing the following error:

{
  "level": "error",
  "ts": "2021-03-20T14:41:55.130Z",
  "msg": "Operation failed with internal error.",
  "service": "cadence-matching",
  "error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
  "metric-scope": 34,
  "logging-call-at": "persistenceMetricClients.go:872",
  "stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"
}

After restarting the workflow, everything went back to normal, but we're still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.

What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we're missing out?

PS: Cadence version is 0.18.3

1 Answers1

0

This is a known issue in gocql that can caused by many reasons:

  1. Cassandra is overloaded and some nodes are not responsive. You may think your load is small, but best way to look at is through Cadence metrics/dashboard. There is a section about persistence metrics.
  2. If 1. is the problem, you may tune the ratelimiting to protect your Cassandra. Using matching.persistenceGlobalMaxQPS will act as a global ratelimiter to override matching.persistenceMaxQPS
  3. Network issue or some bugs in gocql. It's really frustrating. We recently decided to do the refreshing in this PR. Hopefully this will get mitigated in next release.

Also, if a matching node is running hot, probably you are hitting the limit of a single tasklist. If so, consider enable the scalable tasklist feature.

Long Quanzheng
  • 2,076
  • 1
  • 10
  • 22