I have 1000 parquet files and I want one executor to work on each file during an intermediate stage. Is there a possible way that we can manually assign this property? By default spark ends up creating 34 tasks for the job which ends up getting skewed
Asked
Active
Viewed 179 times
0
-
how are you submitting the job? please share the code. – dassum Nov 04 '19 at 18:46
-
@PythonBoi I can suppose that Spark is using `spark.default.parallelism` in this case which is equalt to sum of cores assigned to the task. Are you using Spark Core (RDD) API or Spark SQL (Dataframe/Dataset)? What is the storage (S3/HDFS)? Take a look at this answer https://stackoverflow.com/questions/50825835/does-spark-maintain-parquet-partitioning-on-read/51877075#51877075 – VB_ Nov 04 '19 at 20:22
1 Answers
0
You can do repartition
on your input DataFrame/RDD and do operations on resultant DF/RDD.
changedDF = inputDF.repartition(500)
Instead of using inputDF use changedDF
to perform your operation(s), you should get 500 taks.
If needed, In DataFrame You can also mention list of columns to repartition changedDF = inputDF.repartition(inputDF.col1)

Naga
- 416
- 3
- 11