Count the number of non-null values in a Spark DataFrame

Question

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.

Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.

Say you have:

val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()

How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?

thanks very much, I'd appreciate if you can share some snippets to better understand the logic you propose: I can write a UDF to do this, but I don't get how I can write a code that will execute the UDF for each column — user299791, Jan 20 '17 at 14:27

score 14 · Answer 1 · answered Jan 20 '17 at 14:44

14

One straight forward option is to use .describe() function to get a summary of your data frame, where the count row includes a count of non-null values:

df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

answered Jan 20 '17 at 14:44

Psidom

209,562
33
339
356

1

this works only for numeric columns, not for String... right? – user299791 Jan 20 '17 at 15:05
This works for string columns with Spark 2.2.1: Load with a schema so that we get a string column filled with nothing but nulls: `val xx = spark.read.schema(StructType(Seq(StructField("a", LongType, true), StructField("b", LongType, true), StructField("c", StringType, true), StructField("d", StringType, true)))).json("/tmp/toy.jline")` Then describe. – Max Murphy Apr 09 '18 at 08:48

Raphael Roth · Accepted Answer · 2018-10-03T05:13:18.887

Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...

You can do something like:

import org.apache.spark.sql.functions.{sum,when, count}

df.agg(
   (sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
   (sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
   (sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
 ).show()

EDIT: sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear

Birchlabs · Answer 3 · 2018-07-09T09:49:37.900

Here's how I did it in Scala 2.11, Spark 2.3.1:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

df.agg(
    count("x").divide(count(lit(1)))
        .as("x: percent non-null")
    // ...copy paste that for columns y and z
).head()

count(*) counts non-null rows, count(1) runs on every row.

If you instead want to count percent null in population, find the complement of our count-based equation:

lit(1).minus(
    count("x").divide(count(lit(1)))
    )
    .as("x: percent null")

It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:

// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
    .divide(count(lit(1)))
    .as("x: percent null")

you can also use `count("*")` instead of `count(lit(1))`, I think this is a bit more readable — Raphael Roth, Oct 03 '18 at 05:10

score 0 · Answer 4 · answered Apr 17 '20 at 12:18

0

Here is the simplest query:

d.filter($"x" !== null ).count

answered Apr 17 '20 at 12:18

KayV

12,987
11
98
148

score 0 · Answer 5 · answered Jun 20 '22 at 20:33

0

df.select(df.columns map count: _*)

or

df.select(df.columns map count: _*).toDF(df.columns: _*)

answered Jun 20 '22 at 20:33

ZygD

22,092
39
79
102

ZygD · Answer 6 · 2022-10-05T12:38:18.447

0

Spark 2.3+
(for string and numeric type columns)

df.summary("count").show()
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

edited Oct 05 '22 at 12:38

answered Oct 05 '22 at 05:15

ZygD

22,092
39
79
102

Count the number of non-null values in a Spark DataFrame

6 Answers6

Linked