I'm new to pyspark
and haven't worked with spark
in general for a few years. Can someone explain what happens here:
import random
import pyspark
sc.stop()
sc = pyspark.SparkContext('local[*]')
xx = sc.parallelize(range(100))
yy = sc.parallelize(list(map(lambda x : x *2, range(100))))
print(xx.count())
print(yy.count())
zipped = xx.zip(yy)
print(zipped.collect())
Output:
ValueError Traceback (most recent call last)
<ipython-input-57-a532cb7061c7> in <module>
11 print(yy.count())
12 zipped = xx.zip(yy)
---> 13 print(zipped.collect())
...
...
ValueError: Can not deserialize PairRDD with different number of items in batches: (9, 8)