13

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK.

Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete.

How can I control, and programatically specify the parallelism for my job, and thus the number of workers?

Pablo
  • 10,425
  • 1
  • 44
  • 67
Alex Harvey
  • 215
  • 3
  • 8
  • 4
    This is question specifically about tuning the Dataflow programming environment. Seems totally on topic to me. – Frances Jan 20 '15 at 04:49
  • 6
    Use --numWorkers to set a specific number of workers. If you want to allow the system to tune the number of workers up to a cap, use --autoscalingAlgorithm=BASIC and --maxWorkers=20. – Frances Jan 20 '15 at 04:50
  • 4
    Programmatically, the setNumWorkers(int num) method defined in the DataflowPipelineWorkerPoolOptions Interface should do the trick. More [here](https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/options/DataflowPipelineWorkerPoolOptions). – rf- Jan 20 '15 at 05:27
  • 1
    Although you can use --numWorkers option to pass to the pipeline, but I would suggest to configure your pipeline to autoscale as and when required using --autoscalingAlgorithm=THROUGHPUT_BASED. You can refer to https://cloud.google.com/dataflow/service/dataflow-service-desc – Programmer Aug 31 '16 at 21:06
  • 1
    "Questions on professional server- or networking-related infrastructure administration are off-topic for..." - This is related to the Dataflow programming API, and does not involve admin/config of a server via a CLI or the web dashboard. It is purely about the semantics of the API. How is this off-topic? Was this closed just on the basis that it "felt" like it was about system administration? – talonx Jun 19 '18 at 12:11
  • 1
    This question has been voted 10 times, saved 4, viewed almost 1000. Also, I've rephrased it to focus on the framework's aspects rather than the infrastructure itself. I can improve the phrasing again if necessary. I'd like to request this question be opened. – Pablo Sep 30 '19 at 18:18
  • Have a look here https://stackoverflow.com/questions/51848958/is-there-a-way-to-specify-a-minimum-number-of-workers-for-cloud-dataflow-w-auto/52149991 – MonicaPC Oct 14 '19 at 22:10

0 Answers0