7

Suppose I have a list of columns, for example:

col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']

I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.

How can I do this?

pault
  • 41,343
  • 15
  • 107
  • 149
Amita Rawat
  • 153
  • 1
  • 2
  • 6

1 Answers1

26

You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  A1|  11|  A3|  A4|
|  B1|  22|  B3|  B4|
|  C1|  33|  C3|  C4|
+----+----+----+----+

In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')

col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
|  A1|  11|  A3|  A4|             A111|
|  B1|  22|  B3|  B4|             B122|
|  C1|  33|  C3|  C4|             C133|
+----+----+----+----+-----------------+
cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • can we have a delimiter in between – G.v. Sridhar Oct 07 '20 at 17:13
  • @G.v.Sridhar you can use something like this concat(col("a"), lit(" "), col("b")) and in lit() you can put your delimiter – badger Oct 13 '20 at 10:12
  • 6
    If you want a delimiter, use concat_ws, like this (* is the delimiter): df = df.withColumn('concatenated_cols',concat_ws('*',*col_list)) – Rich Dudley Nov 13 '20 at 18:43
  • when I use concat_ws(",", *col_list), it returns "_1" as col heading instead of "concatenated_cols" and the value becomes Row(concatenated_cols='A1,11,A3,A4) .. how do I fix this ? – Sudhir Jangam Feb 18 '22 at 18:03