I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list
is not available. Does anyone have suggestion how to do it?
Original columns:
ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4
I want output like below, groupby (ID, date, hour)
ID, date, hour, cell_list
1, 1030, 01, cell1, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4
But my pyspark is in 1.4.0, collect_list
is not available. I can't do:
df.groupBy("ID","date","hour").agg(collect_list("cell"))
.