I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv.
Summary:
spark-csv
doesn't seem to support explicit partitioning on import likesc.textFile()
does.- While it gives me inferred schema "for free", by default I'm getting returned DataFrames with normally only 2 partitions, when I'm using 8 executors in my cluster.
- Even though subsequent DataFrames that have many more partitions are being cached via
cache()
and used for further processing (immediately after import of the source files), Spark job history is still showing incredible skew in the task distribution - 2 executors will have the vast majority of the tasks instead of a more even distribution that I expect.
Can't post data, but the code is just some simple joining, adding a few columns via .withColumn()
, and then very basic linear regression via spark.mlib
.
Below is a comparison image from the Spark History UI showing tasks per executor (the last row is the driver).
Note: I get the same skewed task distribution regardless of calling repartition()
on the spark-csv
DataFrames or not.
How do I "force" Spark to basically forget those initial DataFrames and start from more appropriately partitioned DataFrames, or force spark-csv to somehow partition its DataFrames differently (without forking it/modifying its source)?
I can resolve this issue using sc.textFile(file, minPartitions)
, but I'm hoping I don't have to resort to that because of things like the nicely typed schema that spark-csv
provides.