Spark: spark-csv partitioning and parallelism in subsequent DataFrames

Question

I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv.

Summary:

spark-csv doesn't seem to support explicit partitioning on import like sc.textFile() does.
While it gives me inferred schema "for free", by default I'm getting returned DataFrames with normally only 2 partitions, when I'm using 8 executors in my cluster.
Even though subsequent DataFrames that have many more partitions are being cached via cache() and used for further processing (immediately after import of the source files), Spark job history is still showing incredible skew in the task distribution - 2 executors will have the vast majority of the tasks instead of a more even distribution that I expect.

Can't post data, but the code is just some simple joining, adding a few columns via .withColumn(), and then very basic linear regression via spark.mlib.

Below is a comparison image from the Spark History UI showing tasks per executor (the last row is the driver).

Note: I get the same skewed task distribution regardless of calling repartition() on the spark-csv DataFrames or not.

How do I "force" Spark to basically forget those initial DataFrames and start from more appropriately partitioned DataFrames, or force spark-csv to somehow partition its DataFrames differently (without forking it/modifying its source)?

I can resolve this issue using sc.textFile(file, minPartitions), but I'm hoping I don't have to resort to that because of things like the nicely typed schema that spark-csv provides.

@zero323 I can repartition after DataFrame creation just fine, say to 24 partitions, but once the job runs I still see the vast majority of tasks in the job going to only 2 of 8 executors. Any idea why that would still be the case? — chucknelson, Jul 07 '16 at 20:03
Most likely skewed distribution of values. Can you provide example code you use and maybe some output from UI? — zero323, Jul 07 '16 at 20:04
@zero323 I've updated the question with some more detail. I can't provide data, and it may or may not be skewed values somewhere, but I'm looking for a way to get the same behavior I see via `sc.textFile()` after I do an initial import via `spark-csv`. — chucknelson, Jul 08 '16 at 13:35
OK, is this preserved when you decrease `spark.locality.wait`? — zero323, Jul 08 '16 at 13:58
@zero323 Changin `spark.locality.wait` from `3s` (default) to `1s` definitely improved the distribution of tasks. I'm assuming this is not very "intelligent" and is a workaround for the fundamental data partitioning issue? Is this just having Spark randomly decide where to send a task, not taking data locality into account? — chucknelson, Jul 08 '16 at 15:07
@zero323 Also, and I guess as expected, changing the value to `0s` gave an almost perfect distribution of tasks across the executors (and the job completed more quickly). — chucknelson, Jul 08 '16 at 15:14
It is not but description suggest either some configuration or, what is more likely given low number of input partitions, disproportion between amount of data and amount of resources. — zero323, Jul 08 '16 at 15:20
@zero323 Yeah, that may be (resource or data skew), which is why I need to set the partitions explicitly. To the original question, is there any way to get around Spark leveraging the initial partitions brought in when the DataFrame is first created by `spark-csv`? Re-partitioning at any point, even right after the `spark-csv` `load()` doesn't seem to do much. — chucknelson, Jul 08 '16 at 17:12

Spark: spark-csv partitioning and parallelism in subsequent DataFrames

0 Answers0