Spark parallelize and partition by key

Question

In Spark, I can do

sc.parallelize([(0, 0), (1, 1), (0, 2), (1, 3), (0, 4), (1, 5)], 2).partitionBy(2)

However, this first distributes the data across the nodes of the cluster, only to then shuffle it again. Is there a way to partition by key immediately when the data are entered from the driver program?

It is possible to avoid data movement by organizing local data first but it looks like an artificial issue. You should never use `parallelize` to pass data that is large enough for a subsequent shuffle to be an issue. — zero323, May 30 '16 at 10:00

score 0 · Answer 1 · edited May 23 '17 at 12:07

In the example that you had provided, Spark is not aware of partitioning of the data until you are going to explicitly mention that via partitionByKey().

But Spark can leverage the natural partitioning of the data if it is already organized in the appropriate way. For instance,

for Spark, Parquet and HDFS there are the specific set of rules, Spark DataFrames with Parquet and Partitioning
for Spark and Cassandra, you can use specific APIs from Cassandra Connector to leverage partitioning https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key
etc.

So nature of data, file system, etc. will influence partitioning in Spark.

Spark parallelize and partition by key

1 Answers1