is there a way to improve Google Cloud Dataflow Worker Threading numbers?

Question

i knew One worker thread is used per core from link: Google Cloud Dataflow Worker Threading

Say we have one worker with 4 CPU cores and machine type is n1-standard-4.

if there is 4 worer thread processing message from my topic, we noticed the cpu usage is very little.

is there a way to improve worker thread number by coding or pipeline's design?

for example, using group by and windows, is it helpful?

thank you.

yes. we set numberworkers = 1 at deploying our dataflow job. — 馬登成, Sep 05 '19 at 06:15
Could you explain what you are doing inside the pipeline? For what I read you are reading from a topic (I assume PubSub) with has 50M messages (is this per day? in total? ... ). If the bottleneck is while reading, GBK or windows won't make a difference. — Iñigo, Sep 06 '19 at 14:52
I put 500 millions message total into the pubsub. one message's size is 20K. inside the pipeline, search data from datastore by primary key and update that by the primary key to one message. the boottleneck is not reading. i changed db from datastore to memorystore , it changed nothing. i saw the processing threads number only 4 for my data. the worker log level was WARN. there wasn`t a large backlog. — 馬登成, Sep 06 '19 at 22:10

score 0 · Answer 1 · answered Sep 05 '19 at 20:18

0

Since the single worker is keeping up with the topic, it seems that there is no more work to give to the worker to use its CPU.

answered Sep 05 '19 at 20:18

Kenn Knowles

thank you for your message. i put 500 millions message into the topic. there is enough work for the worker i think .and i tried to get the thread dump, i found there was only 4 threads processing my tasks. So i want to find a way to improve worker`s threads. – 馬登成 Sep 06 '19 at 06:08

1 Answers1