I have a question about the DataFusion Data Pipeline

Question

I have a question about the DataFusion Data Pipeline.

I'm using the version of the DataFusion enterprise.

When I create a data pipeline in the Studio of DataFusion, you can set the CPU and memory values of the exit and driver directly in config.

Until now, I knew that if I create a data pipeline, I will create one VM instance per data-pipeline.

However, I just saw that as many VMs are created as Worker nodes, Master nodes.

Then, what does CPU and memory of the exit and driver mean when creating the data-pipeline?

score 2 · Accepted Answer · answered Oct 27 '20 at 20:40

2

For a Spark pipeline run, Data Fusion will start one driver with multiple executors, usually corresponding to the number of worker nodes (though not always). Typically, each worker node executes one executor. Thus, the CPUs and memory settings of the driver and executors set an upper bound on the number of CPUs and amount of memory to use for the run for each executor and the driver.

In practice, this upper bound may not be reached if, for example, you set the memory or CPUs for an executor higher than what is available in the worker node.

answered Oct 27 '20 at 20:40

Dennis Li

121
2

Do I need to match the executor CPU, memory with the work node's CPU and memory? Then what? – Quack Oct 28 '20 at 01:48
Should the executor's CPU and memory be greater than the total CPU and memory of the worker node? (If there are two worker nodes, two CPUs, memory sum) Also, should the driver's CPU and memory be larger than the Executor's CPU and memory? You said that the Driver has multiple Executors, can you explain how to set up the CPU and memory combination between these two? – Quack Oct 28 '20 at 10:31
The CPU should either be set at or below the worker's CPUs. The memory should be set below the worker's memory because some system services need memory to run. However, these settings will generally depend on what kinds of data pipelines you are trying to run. Pipelines handling larger amounts of data using aggregators and joiners should probably have higher CPU and memory counts to perform optimally. Since many of these abstractions map 1:1 with existing YARN abstractions, it might be helpful to look online for MapReduce or Spark resource tuning guides. – Dennis Li Oct 29 '20 at 02:32

I have a question about the DataFusion Data Pipeline

1 Answers1