0

I’m tryng to run each cadence service independently so that I can scale them in and out easily. My teams is using docker-swarm, and we’re managing everything with a Portainer UI. So far, I’ve been able to scale the frontend service to have two replicas, but If I do the same with the matching service, I will get a lot of DecisionTaskTimedOut with a workflow execution. Eventually, the execution will finish successfully but after some long time. To have an idea, It would take 2 minutes with two matching service replicas, while it only takes 7 seconds with just one.

This is a Test environment. I’m using a dockerized cassand db (we cannot use a real one due to some budget restrictions) Maybe that’s the problem? The Docker image is configured with the following enviroment variables:

RINGPOP_BOOTSTRAP_MODE=dns
KEYSPACE=cadence
BIND_ON_IP=0.0.0.0
SKIP_SCHEMA_SETUP=false
VISIBILITY_KEYSPACE=cadence_visibility
CASSANDRA_HOSTNAME=soap_cassandra
RINGPOP_SEEDS=soap_cadence_frontend:7933,soap_cadence_history:7934,soap_cadence_worker:7939
CADENCE_HOME=/etc/cadence
SERVICES=matching

You can assume the default values for any other env var you don’t see above

The RINGPOP_SEEDS are the service names assigned to every cadence service, docker-swarm will create a DNS entry out of them as well as load balancer if there is more than 1 replica declared.

The matching service seems to start correctly, Logs:

{"level":"info","ts":"2021-02-18T22:47:36.296Z","msg":"Created RPC dispatcher and listening","service":"cadence-matching","address":"0.0.0.0:7935","logging-call-at":"rpc.go:81"},
{"level":"warn","ts":"2021-02-18T22:47:36.321Z","msg":"Failed to fetch key from dynamic config","key":"system.advancedVisibilityWritingMode","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.336Z","msg":"Add new peers by DNS lookup","address":"0.0.0.0","addresses":"[0.0.0.0:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.321Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"0.0.0.0:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Starting service matching","logging-call-at":"server.go:217"},
{"level":"warn","ts":"2021-02-18T22:47:36.441Z","msg":"Failed to fetch key from dynamic config","key":"matching.throttledLogRPS","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"127.0.0.1:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.442Z","msg":"Add new peers by DNS lookup","address":"127.0.0.1","addresses":"[127.0.0.1:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.713Z","msg":"matching starting","service":"cadence-matching","logging-call-at":"service.go:90"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"RuntimeMetricsReporter started","service":"cadence-matching","logging-call-at":"runtime.go:169"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"PProf not started due to port not set","logging-call-at":"pprof.go:64"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[[::]:7935]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-worker","addresses":"[[::]:7939]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-frontend","addresses":"[[::]:7933]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"service started","service":"cadence-matching","logging-call-at":"resourceImpl.go:383"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"matching started","service":"cadence-matching","logging-call-at":"service.go:99"}

I can see the following errors in the logs when the workflow is executing:

{"level":"error","ts":"2021-02-18T22:17:07.281Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:52:03.740Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 16, db rangeID: 17","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1500002,"next-number":1500002,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:10:10.971Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"store-operation":"create-task","error":"Failed to create task. TaskList: FeaTaskList, taskListType: 1, rangeID: 94, db rangeID: 95","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"number":9300001,"next-number":9300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:09:53.345Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:53:56.145Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 17, db rangeID: 18","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1600001,"next-number":1600001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}

The docker image version I'm currently using is: ubercadence/server:0.15.1

Is there any way to resolve this issue?

1 Answers1

0

My best guess the problem is BIND_ON_IP=0.0.0.0. Each instance should use unique hostIP:Port as their address. Because it's all 0.0.0.0, every service will only work if running with one instance. Because more than instance will have conflict.

However, it's not a problem for frontend service because FE is stateles. Matching/History will run into this problem ---

HostA register it to mathcing service with 0.0.0.0:7935, and then HostB tries to do the same. This will cause the consistent hashing ring being unstable. The tasklist ownership keeps being switched between HostA and HostB.

To resolve this issue, you need to let each instance uses its own hostIP. Like in K8s uses PodIP.

After you resolve this issue, you will see in the logs in FE/history that they successfully connect to two Matching hosts:

{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[HostA_IP:7935, HostB_IP:7935]","logging-call-at":"rpServiceResolver.go:246"},

See example in Cadence Helm chart that how we do that for K8s: https://github.com/banzaicloud/banzai-charts/blob/87cf2946434c22cb963fea47b662ea85974ecfc0/cadence/templates/server-configmap.yaml#L82

Long Quanzheng
  • 2,076
  • 1
  • 10
  • 22