How to convert empty arrays to nulls?

Question

I have below dataframe and i need to convert empty arrays to null.

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|       []|         []|    
|1112| [45, 46]|   [50, 50]|   
|1113|       []|         []|
+----+---------+-----------+

i have tried below code which is not working.

df.na.fill("null").show()

expected output should be

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|     NUll|       NUll|    
|1112| [45, 46]|   [50, 50]|   
|1113|     NUll|       NUll|
+----+---------+-----------+

Possible duplicate of [Replace empty strings with None/null values in DataFrame](https://stackoverflow.com/questions/33287886/replace-empty-strings-with-none-null-values-in-dataframe) — Nephilim, Jan 03 '18 at 06:37

Ramesh Maharjan · Accepted Answer · 2018-01-03T17:50:12.333

8

For your given dataframe, you can simply do the following

from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
    .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()

You should have output dataframe as

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|
|1111|     null|       null|
|1112| [45, 46]|   [50, 50]|
|1113|     null|       null|
+----+---------+-----------+

Updated

In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

from pyspark.sql import functions as F
for c in df.dtypes:
    if "array" in c[1]:
        df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()

Here,
df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]

withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

edited Jan 03 '18 at 17:50

answered Jan 03 '18 at 16:31

Ramesh Maharjan

41,071
6
69
97

Thank you for your input.My array columns may increase so I am looking for dynamic process in pyspark. so that i wont be going back to code to update when the columns with empty arrays increases. please let me know if there is any dynamic process – Alice Jan 03 '18 at 17:02
@Alice, I have updated my answer, :) I hope its helpful – Ramesh Maharjan Jan 03 '18 at 17:26
This is great, thanks :) can you please explain what "c" reffers to... and what exactly df.dtypes is doing – Alice Jan 03 '18 at 17:36
can you explain the functionality of code for better understanding – Alice Jan 03 '18 at 17:40
@Alice, done the explanation part too. let me know if there's still confusions. – Ramesh Maharjan Jan 03 '18 at 17:51
Thanks for the quick response. one small clarification. c[1] and c[0] refers to column index? or c(0) is column name and c(1) is datatype ? – Alice Jan 03 '18 at 17:56
c[0] is the column name and c[1] is the dataType . – Ramesh Maharjan Jan 04 '18 at 05:08
Thank you very much. This helps me – Alice Jan 04 '18 at 05:52
how this works when nested Arrays? @Alice – rev Nov 11 '21 at 16:49

Raphael Roth · Answer 2 · 2018-01-03T09:10:48.390

4

I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._

val df = Seq(
  (110, Seq.empty[Int]),
  (111, Seq(1,2,3))
).toDF("id","arr")

// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)

// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))

// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)

df
  .select((keepCols ++ emptyArraysAsNulls):_*)
  .show()

+---+---------+
| id|      arr|
+---+---------+
|110|     null|
|111|[1, 2, 3]|
+---+---------+

edited Jan 03 '18 at 09:10

answered Jan 03 '18 at 09:05

Raphael Roth

26,751
15
88
145

i am using the same functions in pyspark and this is the error i am getting arrColsNames = df1.schema.fields.filter(lambda x: x.dataType.isInstanceOf[ArrayType]).map(_.name) AttributeError: 'list' object has no attribute 'filter' – Alice Jan 03 '18 at 09:25
@Alice Sorry, I missed that you are working with Python, my solution works for scala. Maybe this could work in python : `df1.schema.fields.filter(lambda x: x.dataType.typeName == "array").map(_.name)` ? – Raphael Roth Jan 03 '18 at 09:29

score 2 · Answer 3 · answered Jan 03 '18 at 07:02

2

You need to check for the size of the array type column. Like:

df.show()
+----+---+
|  id|arr|
+----+---+
|1110| []|
+----+---+

df.withColumn("arr", when(size(col("arr")) == 0 , lit(None)).otherwise(col("arr") ) ).show()

+----+----+
|  id| arr|
+----+----+
|1110|null|
+----+----+

answered Jan 03 '18 at 07:02

philantrovert

9,904
3
37
61

Thanks for your input, in your code you have taken "arr" column but here i may have many columns which contains empty arrays. so in actual production hard coding is not a best practice right. dont we have function that converts all empty arrays to "null" like df.na,fill() which is not working in this case – Alice Jan 03 '18 at 07:08

Shaido · Answer 4 · 2019-09-04T00:57:28.690

2

There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)

val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
    when(size(col(colname)) === 0, null).otherwise(col(colname))))

First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

Using the dataframe:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]|   []|
|1111|      []| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

The result is as follows:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111|    null| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

edited Sep 04 '19 at 00:57

answered Jan 03 '18 at 08:15

Shaido

27,497
23
70
73

arrColsNames = df1.schema.fields.filter(lambda x: x.dataType.isInstanceOf[ArrayType]).map(_.name) AttributeError: 'list' object has no attribute 'filter' this is the error i am getting – Alice Jan 03 '18 at 09:22
@Alice: In python it should look something like, `arrColsNames = [c.name for c in df1.schema.fields if c.dataType.isInstanceOf(ArrayType)]`. I can't try it out myself right now but hopefully it will help you. – Shaido Jan 03 '18 at 09:30
ERROR ---- arrColsNames = [c.name for c in df1.schema.fields if c.dataType.isInstanceOf(ArrayType)] AttributeError: 'StringType' object has no attribute 'isInstanceOf' – Alice Jan 03 '18 at 09:54
@Alice: Try `arrColsNames = [c.name for c in df1.schema if c.dataType == 'array']`. Should work assuming the same names as in scala. – Shaido Jan 03 '18 at 09:59
can you please look into this syntax as well in pyspark "val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, when(size(col(colname)) === 0, null).otherwise(col(colname))))" – Alice Jan 03 '18 at 10:18

score 1 · Answer 5 · answered Jan 03 '18 at 07:02

1

you can do it with selectExpr:

df_filled = df.selectExpr(
    "id",
    "if(size(column1)<=0, null, column1)",
    "if(size(column2)<=0, null, column2)",
    ...
)

answered Jan 03 '18 at 07:02

FooBee

778
7
23

score 1 · Answer 6 · edited Jan 03 '18 at 14:07

1

df.withColumn("arr", when(size(col("arr")) == 0, lit(None)).otherwise(col("arr") ) ).show()

Please keep in mind, it's also not work in pyspark.

edited Jan 03 '18 at 14:07

cSteusloff

2,487
7
30
51

answered Jan 03 '18 at 11:41

prasad

11
1

score 0 · Answer 7 · answered Jan 04 '18 at 18:32

By taking Ramesh Maharajans above solution as reference. I have found an another way of solution using UDFs. hope this helps you for multiple rules on your dataframe.

df

|store|   1|   2|   3|
+-----+----+----+----+
|  103|[90]|  []|  []|
|  104|  []|[67]|[90]|
|  101|[34]|  []|  []|
|  102|[35]|  []|  []|
+-----+----+----+----+

use below code, import import pyspark.sql.functions as psf This code works in pyspark

def udf1(x :list):
    if x==[]: return "null"
    else: return x
udf2 = udf(udf1, ArrayType(IntegerType()))

for c in df.dtypes:
    if "array" in c[1]:
        df=df.withColumn(c[0],udf2(psf.col(c[0])))
df.show()

output

|store|   1|   2|   3|
+-----+----+----+----+
|  103|[90]|null|null|
|  104|null|[67]|[90]|
|  101|[34]|null|null|
|  102|[35]|null|null|
+-----+----+----+----+

Thanks working perfectly fine. yes this helps me to apply multiple rules on my data frame. — Alice, Jan 04 '18 at 18:37

How to convert empty arrays to nulls?

7 Answers7

Linked