How to retrieve all columns using pyspark collect_list functions

Question

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that

z=data1.groupby('country').agg(F.collect_list('names'))

will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?

I tried this code too

from pyspark.sql import functions as F 
fieldnames=data1.schema.names 
names1= list() 
for item in names: 
   if item != 'names': 
     names1.append(item) 
 z=data1.groupby('names').agg(F.collect_list(names1)) 
 z.show()

but got error message

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist

I tried this code `from pyspark.sql import functions as F fieldnames=data1.schema.names names1= list() for item in names: if item != 'names': names1.append(item) #print item z=data1.groupby('names').agg(F.collect_list(names1)) z.show()` but got error message `Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist` — Python Learner, Oct 18 '17 at 13:37
*please*, do not post code in the comments space! kindly update your post to include the code snippet — desertnaut, Oct 18 '17 at 17:17

pauli · Answer 1 · 2017-10-19T21:49:56.140

Use struct to combine the columns before calling groupBy

suppose you have a dataframe

df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")

df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
|  a|newcol|
+---+------+
|  0| [1,2]|
|  0| [4,5]|
|  1| [7,8]|
|  1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
|  a| collected_col|
+---+--------------+
|  0|[[1,2], [4,5]]|
|  1|[[7,8], [8,7]]|
+---+--------------+

Aggregation operation can be done only on single columns.

After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a udf to separate the combined columns.

from pyspark.sql.types import *
def foo(x):
    x1 = [y[0] for y in x]
    x2 = [y[1] for y in x]
    return(x1,x2)

st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol", 
                  udf_foo("collected_col")).select("a",
                  col("ncol").getItem("b").alias("b"), 
                  col("ncol").getItem("c").alias("c"))
df.show()

+---+------+------+
|  a|     b|     c|
+---+------+------+
|  0|[1, 4]|[2, 5]|
|  1|[7, 8]|[8, 7]|
+---+------+------+

Thanks a lot ashwinids.But I my b & c column should be identified separately along with column a,not the collected_col by combining b & c — Python Learner, Oct 18 '17 at 14:44
Actually I'm trying this because of my question mentioned [here](https://stackoverflow.com/questions/46791254/typeerror-groupeddata-object-is-not-iterable-in-pyspark/46791488?noredirect=1#comment80541321_46791488) — Python Learner, Oct 18 '17 at 14:52
Thank you ashwinids. I'm getting error message that StructType is not defined — Python Learner, Oct 19 '17 at 16:35
Add `from pyspark.sql.types import *` line at the top to import datatypes — pauli, Oct 19 '17 at 21:51

score 3 · Answer 2 · answered Jan 25 '19 at 05:23

Actually we can do it in pyspark 2.2 .

First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.

Below is the code:

import pyspark.sql.functions as ftions
import functools as ftools

def groupColumnData(df, columns):
      df = df.withColumn("Temp", ftions.lit(1))
      exprs = [ftions.collect_list(colName) for colName in columns]
      df = df.groupby('Temp').agg(*exprs)
      df = df.drop("Temp")
      df = df.toDF(*columns)
      return df

Input Data:

df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  1|  2|
|  0|  4|  5|
|  1|  7|  8|
|  1|  8|  7|
+---+---+---+

Output Data:

df.show()

    +------------+------------+------------+
    |           a|           b|           c|
    +------------+------------+------------+
    |[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
    +------------+------------+------------+

score 1 · Answer 3 · answered Jul 23 '20 at 11:21

in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:

df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()

result :

+---+-----------------+
|  a|res              |
+---+-----------------+
|  0|[[1, 2], [4, 5]] |
|  1|[[7, 8], [8, 7]] |
+---+-----------------+

score 0 · Answer 4 · answered Nov 12 '21 at 16:31

I just use Concat_ws function it's perfectly fine.

> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

How to retrieve all columns using pyspark collect_list functions

4 Answers4

Linked