In Spark, I can do
sc.parallelize([(0, 0), (1, 1), (0, 2), (1, 3), (0, 4), (1, 5)], 2).partitionBy(2)
However, this first distributes the data across the nodes of the cluster, only to then shuffle it again. Is there a way to partition by key immediately when the data are entered from the driver program?