0

I have Pyspark dataframe:

id |  column_1       | column_2    | column_3
--------------------------------------------
1  |    ["12"]       |   null     |    ["67"]
--------------------------------------------
2  |    null         |   ["78"]    |   ["90"]
--------------------------------------------
3  |    ["""]        |  ["93"]     |   ["56"]
--------------------------------------------
4  |    ["100"]      |   ["78"]    |   ["90"]
--------------------------------------------

And I need to convert all null values for column1 to empty array []

id |  column_1       | column_2    | column_3
--------------------------------------------
1  |    ["12"]       |   null     |    ["67"]
--------------------------------------------
2  |    []           |   ["78"]    |   ["90"]
--------------------------------------------
3  |    ["""]        |  ["93"]     |   ["56"]
--------------------------------------------
4  |    ["100"]      |   ["78"]    |   ["90"]
--------------------------------------------

Used this code, but it's not working for me.

df.withColumn("column_1", coalesce(column_1, array().cast("array<string>")))

Appreciate your help!

pltc
  • 5,836
  • 1
  • 13
  • 31
Jojo
  • 912
  • 1
  • 7
  • 8

2 Answers2

1

The code working just fine for me, except that you need to wrap column_1 into quotes "column_1". Plus, you don't need to cast, just array() is enough.

df.withColumn("column_1", coalesce('column_1', array()))
pltc
  • 5,836
  • 1
  • 13
  • 31
  • I don't think this will work, coalesce will throw an error saying different dtypes as input to coalesce, array and string, which is 'null'. – Strayhorn Aug 03 '23 at 08:53
0

Use fillna() with subset.

Reference https://stackoverflow.com/a/45070181

pltc
  • 5,836
  • 1
  • 13
  • 31
  • However useful in other situations, apparently `fillna()` **does not accept an empty array**: `TypeError: value should be a float, int, string, bool or dict`. – saza Jun 08 '23 at 23:19