Spark 2.0.0 Error: PartitioningCollection requires all of its partitionings have the same numPartitions

Question

I'm joining some DataFrames together in Spark and I keep getting the following error:

PartitioningCollection requires all of its partitionings have the same numPartitions.

It seems to happen after I join two DataFrames together that each seem fairly reasonable on their own, but after joining them, if I try to get a row from the joined DataFrame, I get this error. I am really just trying to understand why this error might be appearing or what the meaning behind it is as I can't seem to find any documentation on it.

The following invocation results in this exception:

val resultDataframe = dataFrame1
  .join(dataFrame2,     
    $"first_column" === $"second_column").take(2)

but I can certainly call

dataFrame1.take(2)

and

dataFrame2.take(2)

I also tried repartitioning the DataFrames, using Dataset.repartition(numPartitions) or Dataset.coalesce(numParitions) on dataFrame1 and dataFrame2 before joining, and on resultDataFrame after the join, but nothing seemed to have affected the error. I haven't been able to find reference to other individuals getting the error after some cursory googling...

leotac · Answer 1 · 2016-09-30T14:21:52.767

9

I have encountered the same issue in the last few days, and I was disappointed when I found no references on the internet. Until yours!

A couple of things I would add: I get the error after a pretty complicated set of operations on dataframes (multiple joins). Also, these operations involve dataframes that are generated from the same parent dataframe. I'm trying to have a minimal example to replicate it, but it's not trivial to extract it from my pipeline.

I suspect Spark might be having troubles in computing a correct plan when the DAG gets too complicated. Unfortunately, it seems that, if it is a bug in Spark 2.0.0, the nightly builds have not fixed it yet (I've tried a 2.0.2 snapshot a couple of days ago).

A practical solution that fixes the issue (temporarily) seems to be: write to disk (at some point) some of your dataframes in your pipeline, and read them again. This effectively forces Spark to have a much smaller, more manageable plan to optimize, and well, it doesn't crash anymore. Of course it's just a temporary fix.

edited Sep 30 '16 at 14:21

answered Sep 30 '16 at 14:07

leotac

121
6

Thank you for your demonstration of consolidarity and what may hopefully be a useful, albeit well-acknowledged-to-be-temporary solution. I'll try this out, but I think there is some possibility that we may have a bug report on our hands if the solution seems out of the grasp of stackoverflow for a bit longer. – Clemente Cuevas Sep 30 '16 at 16:18
Note also that on version 1.6.x the same code (barring very minor differences) works as intended, not crashing, so it does sound like a bug to me, too. – leotac Sep 30 '16 at 16:30
Your temporary solution did solve the problem though! I hesitate to mark it as the answer just yet, unless no one else responds otherwise and we decide to head to the spark JIRA, then might as well, but thanks. – Clemente Cuevas Sep 30 '16 at 16:59

score 9 · Answer 2 · edited Jun 27 '17 at 18:36

9

I've also had the same problem. For me it occurred after removing some columns from the select part of a join (not the join clause itself).

I was able to fix it by calling .repartition() on the dataframe.

edited Jun 27 '17 at 18:36

Ram Ghadiyaram

28,239
13
95
121

answered Oct 19 '16 at 08:14

Nick Lothian

1,427
1
15
31

Thanks. This was a better fix than the once above! – StackPointer Jul 20 '17 at 19:40

score 3 · Answer 3 · answered Jul 13 '17 at 11:06

3

Do you call the cache method?

This problem happens to me only when I use cache method. If I don't call this method I can use the data without any problem.

answered Jul 13 '17 at 11:06

Luis A.G.

1,017
2
15
23

score 1 · Answer 4 · answered Jun 19 '18 at 15:09

1

This problem is about ReorderJoinPredicates fixed in Spark 2.3.0

answered Jun 19 '18 at 15:09

seaman29

79
1
8

Spark 2.0.0 Error: PartitioningCollection requires all of its partitionings have the same numPartitions

4 Answers4