11

I have a dataset with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?

df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

Any help please to get a dataframe in which we'll find columns and number of missing values for each one.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Maher HTB
  • 737
  • 3
  • 9
  • 23
  • see also https://stackoverflow.com/questions/41765739/count-the-number-of-non-null-values-in-a-spark-dataframe/41766643#41766643 – Raphael Roth Jun 07 '17 at 18:48

3 Answers3

29

You could count the missing values by summing the boolean output of the isNull() method, after converting it to type integer:

In Scala:

import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show

In Python:

from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

Alternatively, you could also use the output of df.describe().filter($"summary" === "count"), and subtract the number in each cell by the number of rows in the data:

In Scala:

import org.apache.spark.sql.functions.lit,

val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show

In Python:

from pyspark.sql.functions import lit

rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()
zero323
  • 322,348
  • 103
  • 959
  • 935
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • thanks for your help .it worked.But is there an alternative waythat takes less time.because it took to much time for large data – Maher HTB Jun 08 '17 at 08:11
  • 2
    @mtoto The describe() function computes all other operations, hence can take more time, use: df.summary("count") to limit computations only for counting. No need to use: df.describe().filter(col("summary") == "count") – Shirish Kadam Sep 17 '18 at 19:00
  • That's very useful thanks, how would I add multiple metrics to the output? i.e. a row for isNull and a row for value=X for each Column – irishguy Apr 27 '20 at 15:55
2
from pyspark.sql.functions import isnull, when, count, col
nacounts = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()
nacounts
keikai
  • 14,085
  • 9
  • 49
  • 68
  • 1
    it would be a more complete answer if you can provide some explanation of what the code is doing and also if you can format the code properly, please see https://stackoverflow.com/editing-help for details on how to format the code. – tsega Mar 17 '20 at 06:57
0
for i in df.columns:
  print(i,df.count()-(df.na.drop(subset=i).count()))