Can not deserialize PairRDD with different number of items in batches

Question

I'm new to pyspark and haven't worked with spark in general for a few years. Can someone explain what happens here:

import random
import pyspark

sc.stop()
sc = pyspark.SparkContext('local[*]')


xx = sc.parallelize(range(100))
yy = sc.parallelize(list(map(lambda x : x *2, range(100))))
print(xx.count())
print(yy.count())
zipped = xx.zip(yy)
print(zipped.collect())

Output:

ValueError                             Traceback (most recent call last)
<ipython-input-57-a532cb7061c7> in <module>
     11 print(yy.count())
     12 zipped = xx.zip(yy)
---> 13 print(zipped.collect())
...  
...
ValueError: Can not deserialize PairRDD with different number of items in batches: (9, 8)

No, sorry, @thebluephantom, I work on that course on weekends. I will check it closer to the end of the week and hopefully confirm that all is correct. — yuranos, Nov 24 '20 at 21:15
If I just change xx and yy definition like: xx = sc.parallelize(range(100)).repartition(5) yy = sc.parallelize(list(map(lambda x : x *2, range(100)))).repartition(5), it doesn't help, I still get: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 1 times, most recent failure: Lost task 1.0 in stage 6.0 (TID 35, 192.168.0.171, executor driver): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition — yuranos, Nov 30 '20 at 13:24
Read the link on the answer elsewhere - things work the way designed to, not how we think. they may / should — thebluephantom, Nov 30 '20 at 13:26
From my findings & from others yes as it gives the same distribution over the partitions this giving equal numbers of entries. A premise for this type of approach. — thebluephantom, Nov 30 '20 at 13:52

thebluephantom · Answer 1 · 2020-11-23T09:10:46.870

0

This means you must have the same partitioner with same number of partitions and same number of key values per partition, else the zipping will not work. 9 <> 8, for example.

For more info: Unable to write PySpark Dataframe created from two zipped dataframes

rdd.glom().collect() shows that not all your xx and yy rdd's comply with same no. of elements per partition. That's the issue.

edited Nov 23 '20 at 09:10

answered Nov 22 '20 at 21:29

thebluephantom

16,458
8
40
83

this answer is as much help as the error message itself. Can you provide a code snippet? – yuranos Nov 22 '20 at 21:56

Can not deserialize PairRDD with different number of items in batches

1 Answers1