0

I am trying to use combineByKey to find the median per key for my assignment (using combineByKey is a requirement of the assignment) and I'm planning to use the following function to return (k, v) pairs where v = a list of all values associated with the same key. After that, I plan to sort the values and then find the median.

data = sc.parallelize([('A',2), ('A',4), ('A',9), ('A',3), ('B',10), ('B',20)])

rdd = data.combineByKey(lambda value: value, lambda c, v: median1(c,v), lambda c1, c2: median2(c1,c2))

def median1 (c,v):
    list = [c]
    list.append(v)
    return list

def median2 (c1,c2):
    list2 = [c1]
    list2.append(c2)
    return list2

However, my code gives output like this:

[('A', [[2, [4, 9]], 3]), ('B', [10, 20])]

where value is a nested list. Is there anyway that I can unnest the values in pyspark to get

[('A', [2, 4, 9, 3]), ('B', [10, 20])]

Or is there other ways I can find the median per key using combineByKey? Thanks!

mayank agrawal
  • 2,495
  • 2
  • 13
  • 32

2 Answers2

0

You just didn't make a good combiner out of the value.

Here is your answer :

data = sc.parallelize([('A',2), ('A',4), ('A',9), ('A',3), ('B',10), ('B',20)])

def createCombiner(value):
    return [value]
def mergeValue(c, value):
    return c.append(value)
def mergeCombiners(c1, c2):
    return c1+c2

rdd = data.combineByKey(createCombiner, mergeValue, mergeCombiners)

[('A', [9, 4, 2, 3]), ('B', [10, 20])]

plalanne
  • 1,010
  • 2
  • 13
  • 30
Pierre Gourseaud
  • 2,347
  • 13
  • 24
  • Thanks! I tried but I got [('A', [9, 4, 2, 3]), ('B', None)] and I guess it has something to do with how spark randomly partitioned the data. I tried to update the function to the below but it still doesnt solve the problem. Do you have any idea to solve it? Thank you. def createCombiner(value): return [value] def mergeValue(c, value): if c == []: result = value elif value == []: result = c else: result = c.append(value) return result def mergeCombiners(c1, c2): if c1 == []: result = c2 elif c2 == []: result = c1 else: result = c1+c2 return result – Data Science Beginner Jun 11 '18 at 17:27
  • I don't know, it worked for me. Regarding the update you tried to make : each function should return an array, where c is an array and v is an integer here. So at least it would be "if c==[]: return [value]" – Pierre Gourseaud Jun 11 '18 at 18:26
0

it's way easier to use collect_list on a dataframe column.

from pyspark.sql.functions import collect_list

df = rdd.toDF(['key', 'values'])

key_lists = df.groupBy('key').agg(collect_list('values').alias('value_list'))
Tim
  • 395
  • 3
  • 11