3

I'm trying to count empty values in column in DataFrame like this:

df.filter((df(colname) === null) || (df(colname) === "")).count()

In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
sergeda
  • 2,061
  • 3
  • 20
  • 43
  • In this thread you find more extensive answers. Just add .count(). https://stackoverflow.com/questions/39727742/how-to-filter-out-a-null-value-from-spark-dataframe – Steffen Schmitz Jun 02 '17 at 14:13

2 Answers2

5

As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.

@Psidom's answer handles both null and empty but does not handle for NaN.

checking for .isNaN should handle all three cases

df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
2

You can use isNull to test the null condition:

val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]

df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2

df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1
Psidom
  • 209,562
  • 33
  • 339
  • 356