I am trying to use combineByKey
to find the median per key for my assignment (using combineByKey
is a requirement of the assignment) and I'm planning to use the following function to return (k, v)
pairs where v = a
list of all values associated with the same key. After that, I plan to sort the values and then find the median.
data = sc.parallelize([('A',2), ('A',4), ('A',9), ('A',3), ('B',10), ('B',20)])
rdd = data.combineByKey(lambda value: value, lambda c, v: median1(c,v), lambda c1, c2: median2(c1,c2))
def median1 (c,v):
list = [c]
list.append(v)
return list
def median2 (c1,c2):
list2 = [c1]
list2.append(c2)
return list2
However, my code gives output like this:
[('A', [[2, [4, 9]], 3]), ('B', [10, 20])]
where value is a nested list. Is there anyway that I can unnest the values in pyspark to get
[('A', [2, 4, 9, 3]), ('B', [10, 20])]
Or is there other ways I can find the median per key using combineByKey
? Thanks!