0

i knew One worker thread is used per core from link: Google Cloud Dataflow Worker Threading

Say we have one worker with 4 CPU cores and machine type is n1-standard-4.

if there is 4 worer thread processing message from my topic, we noticed the cpu usage is very little.

is there a way to improve worker thread number by coding or pipeline's design?

for example, using group by and windows, is it helpful?

thank you.

馬登成
  • 21
  • 1
  • 4
  • Is one worker keeping up with the data on the topic? – Kenn Knowles Sep 05 '19 at 04:32
  • yes. we set numberworkers = 1 at deploying our dataflow job. – 馬登成 Sep 05 '19 at 06:15
  • Could you explain what you are doing inside the pipeline? For what I read you are reading from a topic (I assume PubSub) with has 50M messages (is this per day? in total? ... ). If the bottleneck is while reading, GBK or windows won't make a difference. – Iñigo Sep 06 '19 at 14:52
  • I am confused - was there a large backlog or not? – Kenn Knowles Sep 06 '19 at 20:44
  • I put 500 millions message total into the pubsub. one message's size is 20K. inside the pipeline, search data from datastore by primary key and update that by the primary key to one message. the boottleneck is not reading. i changed db from datastore to memorystore , it changed nothing. i saw the processing threads number only 4 for my data. the worker log level was WARN. there wasn`t a large backlog. – 馬登成 Sep 06 '19 at 22:10

1 Answers1

0

Since the single worker is keeping up with the topic, it seems that there is no more work to give to the worker to use its CPU.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • thank you for your message. i put 500 millions message into the topic. there is enough work for the worker i think .and i tried to get the thread dump, i found there was only 4 threads processing my tasks. So i want to find a way to improve worker`s threads. – 馬登成 Sep 06 '19 at 06:08