0

I'm new to pyspark and haven't worked with spark in general for a few years. Can someone explain what happens here:

import random
import pyspark

sc.stop()
sc = pyspark.SparkContext('local[*]')


xx = sc.parallelize(range(100))
yy = sc.parallelize(list(map(lambda x : x *2, range(100))))
print(xx.count())
print(yy.count())
zipped = xx.zip(yy)
print(zipped.collect())

Output:

ValueError                             Traceback (most recent call last)
<ipython-input-57-a532cb7061c7> in <module>
     11 print(yy.count())
     12 zipped = xx.zip(yy)
---> 13 print(zipped.collect())
...  
...
ValueError: Can not deserialize PairRDD with different number of items in batches: (9, 8)
yuranos
  • 8,799
  • 9
  • 56
  • 65
  • Pretty sure the answer is correct. – thebluephantom Nov 23 '20 at 10:23
  • did u disprove me? – thebluephantom Nov 24 '20 at 19:45
  • No, sorry, @thebluephantom, I work on that course on weekends. I will check it closer to the end of the week and hopefully confirm that all is correct. – yuranos Nov 24 '20 at 21:15
  • look at the example in the link. success – thebluephantom Nov 24 '20 at 21:22
  • If I just change xx and yy definition like: xx = sc.parallelize(range(100)).repartition(5) yy = sc.parallelize(list(map(lambda x : x *2, range(100)))).repartition(5), it doesn't help, I still get: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 1 times, most recent failure: Lost task 1.0 in stage 6.0 (TID 35, 192.168.0.171, executor driver): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition – yuranos Nov 30 '20 at 13:24
  • Read the link on the answer elsewhere - things work the way designed to, not how we think. they may / should – thebluephantom Nov 30 '20 at 13:26
  • Do I need to use zipWithIndex? – yuranos Nov 30 '20 at 13:46
  • From my findings & from others yes as it gives the same distribution over the partitions this giving equal numbers of entries. A premise for this type of approach. – thebluephantom Nov 30 '20 at 13:52

1 Answers1

0

This means you must have the same partitioner with same number of partitions and same number of key values per partition, else the zipping will not work. 9 <> 8, for example.

For more info: Unable to write PySpark Dataframe created from two zipped dataframes

rdd.glom().collect() shows that not all your xx and yy rdd's comply with same no. of elements per partition. That's the issue.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83