1

I'm new to spark. Could somebody explain more detail about "aggregatByKey() lets you return result in different type than input value type while reduceByKey() return the same type as input". If i use reduceByKey() i also can get a different type of value in output:

>>> rdd = sc.parallelize([(1,3),(2,3),(1,2),(2,5)])

>>> rdd.collect()

[(1, 3), (2, 3), (1, 2), (2, 5)]

>>> rdd.reduceByKey(lambda x,y: str(x)+str(y)).collect()

[(2, '35'), (1, '32')]

As we can see - input is int, output - str. Eather i don't understand this diff correctly? whats the point?

next0ne
  • 19
  • 1
  • `str(...)` accepts both `int` and `str` as input, which is why this works. Try with a function that returns a string but only accepts an int as input and it won't work. – Stef Sep 18 '22 at 14:12
  • 4
    Actually, I just tried `rdd.reduceByKey(lambda x,y:str(x+0)+str(y+0)).collect()` and it produced `[(1, '32'), (2, '35')]` instead of crashing. I need to go meditate on the meaning of life now. – Stef Sep 18 '22 at 14:16
  • You can clarify your questions in [this](https://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work) link. – vilalabinot Sep 18 '22 at 16:48

0 Answers0