1

In Spark, I can do

sc.parallelize([(0, 0), (1, 1), (0, 2), (1, 3), (0, 4), (1, 5)], 2).partitionBy(2)

However, this first distributes the data across the nodes of the cluster, only to then shuffle it again. Is there a way to partition by key immediately when the data are entered from the driver program?

Socci
  • 337
  • 2
  • 12
  • It is possible to avoid data movement by organizing local data first but it looks like an artificial issue. You should never use `parallelize` to pass data that is large enough for a subsequent shuffle to be an issue. – zero323 May 30 '16 at 10:00

1 Answers1

0

In the example that you had provided, Spark is not aware of partitioning of the data until you are going to explicitly mention that via partitionByKey().

But Spark can leverage the natural partitioning of the data if it is already organized in the appropriate way. For instance,

So nature of data, file system, etc. will influence partitioning in Spark.

Community
  • 1
  • 1