1

I am using dataflow for Mysql to Bigquery data pipeline. I am using JDBC Mysql to Bigquery dataflow template for that.

While creating a job from dataflow GUI, i can explicitely set the maximum number of workers, total number of workers.

But the problem is, if i mention two workers of n1-standard-4 size, 2 worker are created for some time and automatically one worker is deleted. Why both workers are not running for complete operation?

Also there is no difference in elapsed time even if i use 1 or 2 workers. As per my understanding , the time should be half if i use 2 workers instead of one. No of files created in GCS bucket Temp folder are also same.

How does dataflow manages its workers? How it performs parallel processing? How should i decide the number and type of workers needed for my job ?

Joseph N
  • 540
  • 8
  • 28

2 Answers2

1

Beam framework implements something similar to Map-Reduce. You can parallelize the Map (ParDo -> For Parallel Do) and you can't parallelize the Reduce (GroupBy) (at least, not all Group By can be parallelize).

So, according to your pipeline, Beam is able to dispatch efficiently the messages to process on each worker in parallel and then to wait to perform the GroupBy. The scalability works great for a complex pipeline, especially if you have several entries and/or several outputs.

In your case, your pipeline is very simple: no transformation (that you could do in parallel). Simply Read and Write. What do you want to parallelize? You don't need to have several workers for this!

A last point: the sink that you use, here BigQuery, can have a different behavior according with your pipeline running mode

  • If you run your pipeline in batch mode (your case), BigQuery.IO simply takes the data and create file in Cloud Storage staging bucket. Then, at the end, trigger an unique load job of all the files in the correct table
  • If you run your pipeline in streaming mode, BigQuery.IO will perform a stream write into BigQuery.

This mode can influence the parallelization capacity and the possible number of workers.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • Thanks guillaume blaquiere, my concern is, what if i have hundreds of GB of data? Will dataflow load that data to bigquery using only one worker? It will be very time consuming. If i perform same operation in apache sqoop, it will distribute the data to several parallel processes and speed will be more. Cant i do same thing in bigquery? if i have 100 GB of data in source database, if 10 workers are running , will take less time to process that data. This is what i am trying. – Joseph N Nov 19 '20 at 13:16
  • If you perform no processing (simply read MySQL and Write to BigQuery), 1 worker with 4 CPU should be enough. Only the network bandwidth will limit you (from the source or from [the worker](https://cloud.google.com/compute/docs/machine-types#n1_standard_machine_types)) Can you really provide more than 10Gbps of data from the source? With this bandwidth, 100GB take 50s to be transferred. A dataflow pipeline takes about 5 minutes to start and stop. Is the speed a concern? – guillaume blaquiere Nov 19 '20 at 13:59
  • Yeah guillaume blaquiere.. but whats the purpose of specifying 2 or more workers then ? i mean there must be some reason behind using multiple workers. – Joseph N Nov 19 '20 at 14:31
  • It's the default Dataflow params. For this template, it's useless. if you have 100Gb to sync, you can try with small vm (n1-standard1) not multi-threadable on the same VM, and with low network bandwitdh. Maybe you will leverage 2 parallel workers. – guillaume blaquiere Nov 19 '20 at 15:09
1

There are a couple of plausible reasons for which your Dataflow job does not keep the two workers until the end:

-1st: Either the full job or some task is not parallelisable. Dataflow will remove the second worker in order for you not to incur in additional costs while the worker is idle.

-2nd: If the workers are using on average less than 75% of their CPUs over two minutes, and the streaming pipeline backlog is lower than 10 seconds (1).

Please bear in mind that scaling down does not occur automatically as Dataflow is, in this sense, conservative. Normally, Dataflow will spend more time trying to add workers than using them. It's for that reason that when you expect a high workload with sharp peaks, it is advisable to set a high starting number of workers.

On the other hand, if only 1 of the two workers is being used, the total amount of time will be the same regardless of whether you set the number of workers to 1 or 2. To better understand this concept, let me give an example:

Imagine you have an algorithm that produces a sequence of pseudo-random numbers where each value computation depends on the last number. This is a task where it does not matter if you have 1 or 100 workers, it will always work at the same speed. But at the same time, for other use cases, for example if each number doesn't depend on the previous one, this task would be approximately 100 times faster with 100 workers.

All in all, Dataflow considers the parallalelisability of each task and, depending on the rules stated in (1), scales up and down. A higher number of workers may or may not be faster, but it will be more expensive.

Please take a look at (2) for a better insight on Parallelization and distribution in Dataflow. I've also found these two Stack Overflow questions (3) and (4) that might help shed some light on your question.

rodvictor
  • 323
  • 1
  • 10
  • Thanks rodvictor, in my case, dataflow reading mysql tables and writing to bigquery. Can i distribute the data to multiple workers to decrease the time? – Joseph N Nov 19 '20 at 15:24
  • The issue with physically reading and writing from and to disk is that it is not a parallelisable task as it is explained in detail in (1). If you were to do some processing on your data before copying it to BigQuery, such as changing the format, counting words, etc, this central part of the pipeline could be parallelised much more efficiently. (1): https://stackoverflow.com/questions/34667551/parallel-file-writing-is-it-efficient – rodvictor Nov 19 '20 at 15:42