Replace/Convert null value to empty array in pyspark

Question

I have Pyspark dataframe:

id |  column_1       | column_2    | column_3
--------------------------------------------
1  |    ["12"]       |   null     |    ["67"]
--------------------------------------------
2  |    null         |   ["78"]    |   ["90"]
--------------------------------------------
3  |    ["""]        |  ["93"]     |   ["56"]
--------------------------------------------
4  |    ["100"]      |   ["78"]    |   ["90"]
--------------------------------------------

And I need to convert all null values for column1 to empty array []

id |  column_1       | column_2    | column_3
--------------------------------------------
1  |    ["12"]       |   null     |    ["67"]
--------------------------------------------
2  |    []           |   ["78"]    |   ["90"]
--------------------------------------------
3  |    ["""]        |  ["93"]     |   ["56"]
--------------------------------------------
4  |    ["100"]      |   ["78"]    |   ["90"]
--------------------------------------------

Used this code, but it's not working for me.

df.withColumn("column_1", coalesce(column_1, array().cast("array<string>")))

Appreciate your help!

score 1 · Answer 1 · answered Oct 22 '21 at 20:43

1

The code working just fine for me, except that you need to wrap column_1 into quotes "column_1". Plus, you don't need to cast, just array() is enough.

df.withColumn("column_1", coalesce('column_1', array()))

answered Oct 22 '21 at 20:43

pltc

5,836
1
13
31

I don't think this will work, coalesce will throw an error saying different dtypes as input to coalesce, array and string, which is 'null'. – Strayhorn Aug 03 '23 at 08:53

score 0 · Answer 2 · edited Oct 24 '21 at 19:31

0

Use fillna() with subset.

Reference https://stackoverflow.com/a/45070181

edited Oct 24 '21 at 19:31

pltc

5,836
1
13
31

answered Oct 23 '21 at 12:44

MindedFree

1
1

However useful in other situations, apparently `fillna()` **does not accept an empty array**: `TypeError: value should be a float, int, string, bool or dict`. – saza Jun 08 '23 at 23:19

Replace/Convert null value to empty array in pyspark

2 Answers2