pyspark 1.4 how to get list in aggregated function

Question

I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list is not available. Does anyone have suggestion how to do it?

Original columns:

ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

I want output like below, groupby (ID, date, hour)

ID, date, hour, cell_list
1, 1030, 01, cell1, cell2
2, 1030, 01, cell3
2, 1030, 02, cell4

But my pyspark is in 1.4.0, collect_list is not available. I can't do: df.groupBy("ID","date","hour").agg(collect_list("cell")).

/opt/spark1.4/bin/spark-submit --master yarn-client --num-executors 37 xxx.py — Helen Z, Dec 06 '17 at 23:18

score 0 · Answer 1 · answered Dec 06 '17 at 23:23

Spark 1.4 is old, not supported, slow, buggy and compatible with current versions. You should really consider upgrading Spark installation

Enable Hive support, register DataFrame as a temporary table, and use

sqlContext  = HiveContext(sc)

df = ...  # create table using HiveContext
df.registerTempTable("df")

sqlContext.sql(
  "SELECT id, date, hour, collect_list(cell) GROUP BY id, date, hour FROM df" 
)

Since you use YARN you should should be able to submit any version of Spark code, but it might require placing custom PySpark version on the PYTHONPATH.

pyspark 1.4 how to get list in aggregated function

1 Answers1