0

We have encountered a longstanding issue related to Cadence and Cassandra. When making a request to retrieve a list of closed workflows (either "closed" or "completed") via Cadence-web or cli cadence-fronted, one or two nodes from the Cassandra cluster go down. This seems to indicate that they cannot handle such a request and become inoperative.

Consequently, the entire functionality of the Cadence + Cassandra integration halts, preventing the creation of new workflows and processing of previously established ones.

Has anyone experienced a similar issue in the past and found a solution? What can we try to understand and rectify the cause of this problem?

Versions in use: cadence - 0.11.0, cadence-web - 3.11, cassandra - 3.11.5, Cassandra-cluster - 5 nodes, replica factor = 3

Approximately 13 million closed workflows are recorded over a retention period of 5 days.

A similar search for a smaller amount of data (workflow) does not lead to such problems

  • Without stack trace or other information like your read queries, table schema, etc., this is hard to triage. Also, this appears that your cluster is not sized properly to handle the increased to load from what you're explaining in the question. Have you done a [capacity sizing and cluster testing](https://docs.datastax.com/en/dseplanning/docs/planning-testing.html) before putting this load on the cluster? If so, could you share the results. – Madhavan Aug 29 '23 at 12:31
  • Since this is a developer platform, I'd recommend you to move this post to [StackExchange](https://stackoverflow.com/questions/tagged/cassandra) instead. – Madhavan Aug 29 '23 at 12:33
  • Thanks for the reply. The Cassandra cluster itself and Cadence have been working very well for more than 3 years and without any failures. But it is precisely in situations when we request a list of closed workflows that problems arise. We suspect that it is the volume of the requested data, but we are not completely sure. – Anush Chinoyan Aug 29 '23 at 14:01
  • Regarding testing this problem, there is no such possibility yet, but we will try in the future. – Anush Chinoyan Aug 29 '23 at 14:01
  • If you don't want to manage or perform sizing exercises as your workload increases / decreases randomly, you should try out the serverless database as a service platform [here](https://www.datastax.com/astra) which has [efficient indexing](https://docs.datastax.com/en/cql/astra/docs/developing/indexing/sai/sai-overview.html) too. – Madhavan Aug 30 '23 at 18:44

1 Answers1

0

I'm pretty sure that your Cassandra nodes get an out-of-memory exception.

The solution is to not use Cassandra for indexing the large number of workflows as secondary indexes are really broken. That's why both Cadence and Temporal provide ElasticSearch integration.

Maxim Fateev
  • 6,458
  • 3
  • 20
  • 35