Manually specify number of spark executors

Question

I have 1000 parquet files and I want one executor to work on each file during an intermediate stage. Is there a possible way that we can manually assign this property? By default spark ends up creating 34 tasks for the job which ends up getting skewed

@PythonBoi I can suppose that Spark is using `spark.default.parallelism` in this case which is equalt to sum of cores assigned to the task. Are you using Spark Core (RDD) API or Spark SQL (Dataframe/Dataset)? What is the storage (S3/HDFS)? Take a look at this answer https://stackoverflow.com/questions/50825835/does-spark-maintain-parquet-partitioning-on-read/51877075#51877075 — VB_, Nov 04 '19 at 20:22

score 0 · Answer 1 · answered Nov 04 '19 at 18:52

You can do repartition on your input DataFrame/RDD and do operations on resultant DF/RDD.

changedDF = inputDF.repartition(500)

Instead of using inputDF use changedDF to perform your operation(s), you should get 500 taks.

If needed, In DataFrame You can also mention list of columns to repartition changedDF = inputDF.repartition(inputDF.col1)

Manually specify number of spark executors

1 Answers1