1

I'm trying to do something that seems pretty much straightforward but somehow cannot figure how to do it with pyspark.

I have a df with two columns (to simplify) 'id' and 'strcol', with possible duplicates ids

I want to do a df.groupBy('id') that would return for each id an array of the strcol values

simple exemple :

|--id--|--strCol--|
|   a  |  {'a':1} |
|   a  |  {'a':2} |
|   b  |  {'b':3} |
|   b  |  {'b':4} |
|------|----------|
would become
|--id--|-------aggsStr------|
|   a  |  [{'a':1},{'a':2}] |
|   b  |  [{'b':3},{'b':4}] |
|------|--------------------|

I tried to use apply with a pandas udf but it seems to refuse to return arrays. (or maybe I didn't use it correctly)

Vincent Chalmel
  • 590
  • 1
  • 6
  • 28

1 Answers1

2

You can use collect_list from the pyspark.sql.functions module:

from pyspark.sql import functions as F
agg = df.groupby("id").agg(F.collect_list("strCol"))

A fully functional example:

import pandas as pd
from pyspark.sql import functions as F

data =  {'id': ['a', 'a', 'b', 'b'], 'strCol': [{'a':1}, {'a':2}, {'b':3}, {'b':4}]}

df_aux = pd.DataFrame(data)

# df type: DataFrame[id: string, strCol: map<string,bigint>]
df = spark.createDataFrame(df_aux) 


# agg type: # DataFrame[id: string, collect_list(strCol): array<map<string,bigint>>]
agg = df.groupby("id").agg(F.collect_list("strCol")) 

Hope this helped!

dataista
  • 3,187
  • 1
  • 16
  • 23