Counting number of nulls in pyspark dataframe by row

Question

So I want to count the number of nulls in a dataframe by row.

Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.

For example, a subset:

columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()

+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
|  1|    2|  'A'| null|
|  2| null|    1| null|
|  3| null|    9|  'C'|
+---+-----+-----+-----+

After running the code, the desired output is:

+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|  'A'| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|  'C'|       1|
+---+-----+-----+-----+--------+

EDIT: Not all non null values are ints.

score 21 · Accepted Answer · answered Oct 17 '18 at 23:09

21

Convert null to 1 and others to 0 and then sum all the columns:

df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|    0| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|    1|       1|
+---+-----+-----+-----+--------+

answered Oct 17 '18 at 23:09

Psidom

209,562
33
339
356

The values are not actually always ints, I have updated the question to reflect that. – tormond Oct 17 '18 at 23:14
1

The answer doesn’t assume ints. It checks null generally and if it’s null replace the value as 1 otherwise 0. And then do the sum. – Psidom Oct 17 '18 at 23:17
Works perfect. Thanks. – tormond Oct 18 '18 at 18:09
1

I get the error `TypeError: 'Column' object is not callable` for the same example – Ali Apr 11 '19 at 01:23
3

So I was using [pyspark.sql.functions.sum](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum) instead of Python [sum](https://docs.python.org/2/library/functions.html#sum) which caused the problem for me. More about the difference [here](https://stackoverflow.com/questions/47690615/whats-is-the-correct-way-to-sum-different-dataframe-columns-in-a-list-in-pyspark) – Ali Apr 11 '19 at 05:47
For PySpark 3.3.1, you need to use `lit` for the sum in order for this to work. `df.withColumn('numNulls', lit(sum(df[col].isNull().cast('int') for col in df.columns)) ` – Quan Bui Nov 30 '22 at 10:47

KIRAN JAGTAP · Answer 2 · 2023-04-10T17:50:36.090

# Create new dataFrame with only 'id' column and 'numNulls'(which count all null values by row) columns
# To create new dataFrame first convert old dataFrame into RDD and perform following operation and again convert it into DataFrame

df2 = df.rdd.map(lambda x: (x[0], x.count(None))).toDF(['id','numNulls'])
df2.show()

+---+--------+
| id|numNulls|
+---+--------+
|  1|       1|
|  2|       2|
|  3|       1|
+---+--------+

# now join old dataFrame and new dataFrame on the basis of 'id' column

df3 = df.join(df2, df.id == df2.id, 'inner').drop(df2.id)
df3.show()

+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|    A| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|    C|       1|
+---+-----+-----+-----+--------+

Please add your answer in textual format. Happy coding! – Dipanjan Mallick Apr 10 '23 at 17:39 — Dipanjan Mallick, Apr 10 '23 at 17:39

Counting number of nulls in pyspark dataframe by row

2 Answers2

Linked