Count the number of missing values in a dataframe Spark

Question

I have a dataset with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?

df.describe().filter($"summary" === "count").show

+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

Any help please to get a dataframe in which we'll find columns and number of missing values for each one.

see also https://stackoverflow.com/questions/41765739/count-the-number-of-non-null-values-in-a-spark-dataframe/41766643#41766643 — Raphael Roth, Jun 07 '17 at 18:48

score 29 · Accepted Answer · edited Jan 14 '19 at 16:15

29

You could count the missing values by summing the boolean output of the isNull() method, after converting it to type integer:

In Scala:

import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show

In Python:

from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

Alternatively, you could also use the output of df.describe().filter($"summary" === "count"), and subtract the number in each cell by the number of rows in the data:

In Scala:

import org.apache.spark.sql.functions.lit,

val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show

In Python:

from pyspark.sql.functions import lit

rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()

edited Jan 14 '19 at 16:15

zero323

322,348
103
959
935

answered Jun 07 '17 at 13:02

mtoto

23,919
4
58
71

thanks for your help .it worked.But is there an alternative waythat takes less time.because it took to much time for large data – Maher HTB Jun 08 '17 at 08:11
2

@mtoto The describe() function computes all other operations, hence can take more time, use: df.summary("count") to limit computations only for counting. No need to use: df.describe().filter(col("summary") == "count") – Shirish Kadam Sep 17 '18 at 19:00
That's very useful thanks, how would I add multiple metrics to the output? i.e. a row for isNull and a row for value=X for each Column – irishguy Apr 27 '20 at 15:55

score 2 · Answer 2 · edited Mar 17 '20 at 16:53

2

from pyspark.sql.functions import isnull, when, count, col
nacounts = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()
nacounts

edited Mar 17 '20 at 16:53

keikai

14,085
9
49
68

answered Mar 17 '20 at 06:35

Harish Vasudev

21
1

1

it would be a more complete answer if you can provide some explanation of what the code is doing and also if you can format the code properly, please see https://stackoverflow.com/editing-help for details on how to format the code. – tsega Mar 17 '20 at 06:57

score 0 · Answer 3 · answered Jan 22 '22 at 18:35

0

for i in df.columns:
  print(i,df.count()-(df.na.drop(subset=i).count()))

answered Jan 22 '22 at 18:35

hailathamzeh

1

Count the number of missing values in a dataframe Spark

3 Answers3

Linked

Related