Spark - difference between aggregateByKey() and reduceByKey()

Asked Sep 18 '22 at 14:07

Active Sep 18 '22 at 14:07

Viewed 107 times

I'm new to spark. Could somebody explain more detail about "aggregatByKey() lets you return result in different type than input value type while reduceByKey() return the same type as input". If i use reduceByKey() i also can get a different type of value in output:

>>> rdd = sc.parallelize([(1,3),(2,3),(1,2),(2,5)])

>>> rdd.collect()

[(1, 3), (2, 3), (1, 2), (2, 5)]

>>> rdd.reduceByKey(lambda x,y: str(x)+str(y)).collect()

[(2, '35'), (1, '32')]

As we can see - input is int, output - str. Eather i don't understand this diff correctly? whats the point?

asked Sep 18 '22 at 14:07

next0ne

`str(...)` accepts both `int` and `str` as input, which is why this works. Try with a function that returns a string but only accepts an int as input and it won't work. – Stef Sep 18 '22 at 14:12
4

Actually, I just tried `rdd.reduceByKey(lambda x,y:str(x+0)+str(y+0)).collect()` and it produced `[(1, '32'), (2, '35')]` instead of crashing. I need to go meditate on the meaning of life now. – Stef Sep 18 '22 at 14:16
You can clarify your questions in [this](https://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work) link. – vilalabinot Sep 18 '22 at 16:48

Spark - difference between aggregateByKey() and reduceByKey()

0 Answers0