Why is dataset.count causing a shuffle! (spark 2.2)

Question

Here is my dataframe:

The underlying RDD has 2 partitions

When I do a df.count, the DAG produced is

When I do a df.rdd.count, the DAG produced is:

Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the DataFrame.’. Now, when I perform the count on the dataframe why is a shuffle occurring? Besides, when I do the same on the underlying RDD no shuffle occurs.

It makes no sense to me why a shuffle would occur anyway. I tried to go through the source code of count here spark github But it doesn’t make sense to me fully. Is the “groupby” being supplied to the action the culprit?

PS. df.coalesce(1).count does not cause any shuffle

score 9 · Answer 1 · answered Nov 09 '17 at 15:20

It seems that DataFrame's count operation uses groupBy resulting in shuffle. Below is the code from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

* Returns the number of rows in the Dataset.
* @group action
* @since 1.6.0
*/
def count(): Long = withAction("count", groupBy().count().queryExecution) { 
plan =>
plan.executeCollect().head.getLong(0)
}

While if you look at RDD's count function, it passes on the aggregate function to each of the partitions, which returns the sum of each partition as Array and then use .sum to sum elements of array.

Code snippet from this link: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Thanks Pratyush. A few questions: 1. how exactly does "groupBy().count().queryExecution" work? Since groupby and count both are methods? 2.what does the underscore in "sc.runJob(this, Utils.getIteratorSize _)" mean? — human, Nov 10 '17 at 01:47

score 5 · Answer 2 · answered Nov 09 '17 at 10:46

5

When spark is doing dataframe operation, it does first compute partial counts for every partition and then having another stage to sum those up together. This is particularly good for large dataframes, where distributing counts to multiple executors actually adds to performance.

The place to verify this is SQL tab of Spark UI, which would have some sort of the following physical plan description:

*HashAggregate(keys=[], functions=[count(1)], output=[count#202L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#206L])

answered Nov 09 '17 at 10:46

nefo_x

3,050
4
27
40

That makes some sense. What happens in the case of rdd.count then? Assume rdd has 2 partitions. – human Nov 09 '17 at 11:51
most likely those RDD partitions were on the same executor at the moment of operation. I'm not aware of deeper details on RDDs, though was heavily looking into mechanics of DFs. one of the practical things of RDDs is that you use them only when you need to transform dataframes or create dataframes from less structured sources. as dataframes are generally faster in working with structured data. – nefo_x Nov 09 '17 at 12:22
I ran the test on 3 partitioned df and confirm your result that the individual partition counts are being calculated in one stage and then the shuffle is incurred that writes 3 partitions. A following stage reads these 3 partitions and sums them up. However, this kinda doesnt make sense as well: why should the shuffle occur since it looks like a narrow dependency transformation+action. – human Nov 09 '17 at 12:35
4

I am yet to somehow verify it (perhaps visit the code) however I believe the way rdd.count works is where individual partition counts are calculated and sent to driver to do a final sum - all this happens in one stage. – human Nov 09 '17 at 12:41

score 2 · Answer 3 · answered Oct 27 '18 at 14:49

2

In the shuffle stage, the key is empty, and the value is count of the partition, and all these (key,value) pairs are shuffled to one single partition.

That is, the data moved in the shuffle stage is very little.

answered Oct 27 '18 at 14:49

Tom

5,848
12
44
104

So, does Spark keep a count of the raw records in the partition? – thebluephantom Apr 02 '20 at 12:12

Why is dataset.count causing a shuffle! (spark 2.2)

3 Answers3