7

I have below dataframe and i need to convert empty arrays to null.

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|       []|         []|    
|1112| [45, 46]|   [50, 50]|   
|1113|       []|         []|
+----+---------+-----------+

i have tried below code which is not working.

df.na.fill("null").show()

expected output should be

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|     NUll|       NUll|    
|1112| [45, 46]|   [50, 50]|   
|1113|     NUll|       NUll|
+----+---------+-----------+
ZygD
  • 22,092
  • 39
  • 79
  • 102
Alice
  • 165
  • 2
  • 4
  • 13
  • 1
    Possible duplicate of [Replace empty strings with None/null values in DataFrame](https://stackoverflow.com/questions/33287886/replace-empty-strings-with-none-null-values-in-dataframe) – Nephilim Jan 03 '18 at 06:37

7 Answers7

8

For your given dataframe, you can simply do the following

from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
    .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()

You should have output dataframe as

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|
|1111|     null|       null|
|1112| [45, 46]|   [50, 50]|
|1113|     null|       null|
+----+---------+-----------+

Updated

In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

from pyspark.sql import functions as F
for c in df.dtypes:
    if "array" in c[1]:
        df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()

Here,
df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]

withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
4

I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._

val df = Seq(
  (110, Seq.empty[Int]),
  (111, Seq(1,2,3))
).toDF("id","arr")

// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)

// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))

// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)

df
  .select((keepCols ++ emptyArraysAsNulls):_*)
  .show()

+---+---------+
| id|      arr|
+---+---------+
|110|     null|
|111|[1, 2, 3]|
+---+---------+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • i am using the same functions in pyspark and this is the error i am getting arrColsNames = df1.schema.fields.filter(lambda x: x.dataType.isInstanceOf[ArrayType]).map(_.name) AttributeError: 'list' object has no attribute 'filter' – Alice Jan 03 '18 at 09:25
  • @Alice Sorry, I missed that you are working with Python, my solution works for scala. Maybe this could work in python : `df1.schema.fields.filter(lambda x: x.dataType.typeName == "array").map(_.name)` ? – Raphael Roth Jan 03 '18 at 09:29
2

You need to check for the size of the array type column. Like:

df.show()
+----+---+
|  id|arr|
+----+---+
|1110| []|
+----+---+

df.withColumn("arr", when(size(col("arr")) == 0 , lit(None)).otherwise(col("arr") ) ).show()

+----+----+
|  id| arr|
+----+----+
|1110|null|
+----+----+
philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • Thanks for your input, in your code you have taken "arr" column but here i may have many columns which contains empty arrays. so in actual production hard coding is not a best practice right. dont we have function that converts all empty arrays to "null" like df.na,fill() which is not working in this case – Alice Jan 03 '18 at 07:08
2

There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)

val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
    when(size(col(colname)) === 0, null).otherwise(col(colname))))

First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

Using the dataframe:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]|   []|
|1111|      []| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

The result is as follows:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111|    null| [11]|
|1112|   [123]|[321]|
+----+--------+-----+
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • arrColsNames = df1.schema.fields.filter(lambda x: x.dataType.isInstanceOf[ArrayType]).map(_.name) AttributeError: 'list' object has no attribute 'filter' this is the error i am getting – Alice Jan 03 '18 at 09:22
  • @Alice: In python it should look something like, `arrColsNames = [c.name for c in df1.schema.fields if c.dataType.isInstanceOf(ArrayType)]`. I can't try it out myself right now but hopefully it will help you. – Shaido Jan 03 '18 at 09:30
  • ERROR ---- arrColsNames = [c.name for c in df1.schema.fields if c.dataType.isInstanceOf(ArrayType)] AttributeError: 'StringType' object has no attribute 'isInstanceOf' – Alice Jan 03 '18 at 09:54
  • @Alice: Try `arrColsNames = [c.name for c in df1.schema if c.dataType == 'array']`. Should work assuming the same names as in scala. – Shaido Jan 03 '18 at 09:59
  • can you please look into this syntax as well in pyspark "val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, when(size(col(colname)) === 0, null).otherwise(col(colname))))" – Alice Jan 03 '18 at 10:18
1

you can do it with selectExpr:

df_filled = df.selectExpr(
    "id",
    "if(size(column1)<=0, null, column1)",
    "if(size(column2)<=0, null, column2)",
    ...
)
FooBee
  • 778
  • 7
  • 23
1
df.withColumn("arr", when(size(col("arr")) == 0, lit(None)).otherwise(col("arr") ) ).show()

Please keep in mind, it's also not work in pyspark.

cSteusloff
  • 2,487
  • 7
  • 30
  • 51
prasad
  • 11
  • 1
0

By taking Ramesh Maharajans above solution as reference. I have found an another way of solution using UDFs. hope this helps you for multiple rules on your dataframe.

df

|store|   1|   2|   3|
+-----+----+----+----+
|  103|[90]|  []|  []|
|  104|  []|[67]|[90]|
|  101|[34]|  []|  []|
|  102|[35]|  []|  []|
+-----+----+----+----+

use below code, import import pyspark.sql.functions as psf This code works in pyspark

def udf1(x :list):
    if x==[]: return "null"
    else: return x
udf2 = udf(udf1, ArrayType(IntegerType()))

for c in df.dtypes:
    if "array" in c[1]:
        df=df.withColumn(c[0],udf2(psf.col(c[0])))
df.show()

output

|store|   1|   2|   3|
+-----+----+----+----+
|  103|[90]|null|null|
|  104|null|[67]|[90]|
|  101|[34]|null|null|
|  102|[35]|null|null|
+-----+----+----+----+
LUZO
  • 1,019
  • 4
  • 19
  • 42
  • Thanks working perfectly fine. yes this helps me to apply multiple rules on my data frame. – Alice Jan 04 '18 at 18:37