How can I specify the number of workers for my Dataflow?

Question

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK.

Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete.

How can I control, and programatically specify the parallelism for my job, and thus the number of workers?

This is question specifically about tuning the Dataflow programming environment. Seems totally on topic to me. — Frances, Jan 20 '15 at 04:49
Use --numWorkers to set a specific number of workers. If you want to allow the system to tune the number of workers up to a cap, use --autoscalingAlgorithm=BASIC and --maxWorkers=20. — Frances, Jan 20 '15 at 04:50
Programmatically, the setNumWorkers(int num) method defined in the DataflowPipelineWorkerPoolOptions Interface should do the trick. More [here](https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/options/DataflowPipelineWorkerPoolOptions). — rf-, Jan 20 '15 at 05:27
Although you can use --numWorkers option to pass to the pipeline, but I would suggest to configure your pipeline to autoscale as and when required using --autoscalingAlgorithm=THROUGHPUT_BASED. You can refer to https://cloud.google.com/dataflow/service/dataflow-service-desc — Programmer, Aug 31 '16 at 21:06
"Questions on professional server- or networking-related infrastructure administration are off-topic for..." - This is related to the Dataflow programming API, and does not involve admin/config of a server via a CLI or the web dashboard. It is purely about the semantics of the API. How is this off-topic? Was this closed just on the basis that it "felt" like it was about system administration? — talonx, Jun 19 '18 at 12:11
This question has been voted 10 times, saved 4, viewed almost 1000. Also, I've rephrased it to focus on the framework's aspects rather than the infrastructure itself. I can improve the phrasing again if necessary. I'd like to request this question be opened. — Pablo, Sep 30 '19 at 18:18
Have a look here https://stackoverflow.com/questions/51848958/is-there-a-way-to-specify-a-minimum-number-of-workers-for-cloud-dataflow-w-auto/52149991 — MonicaPC, Oct 14 '19 at 22:10

How can I specify the number of workers for my Dataflow?

0 Answers0