1

What is a good method to sum dataframes for all Null / NaN values when using Koalas?

or stated another way

How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if possible.

NOTE: .sum() omits null values in Koalas (skipna:boolean, default True - can't change to False). So running df1.isnull().sum() is out of the question

numpy was listed as an alternative but due to the dataframe being in Koalas I observed that .sum() still was omitting the nan values.

Disclaimer: I get I can run pandas on Spark but I understand that is counter productive resource wise. I hesitate to sum it from a Spark or Pandas dataframe and then convert the dataframe into Koalas (again wasting resources in my opinion). I'm working with a dataset that contains 73 columns and 4m rows.

SteveZ
  • 21
  • 3
  • Just in case you decide to do it in spark, this is how its done [here](https://stackoverflow.com/questions/64147246/pyspark-need-to-show-a-count-of-null-empty-values-per-each-column-in-a-datafram/64157257?noredirect=1#comment113459640_64157257) – jayrythium Oct 06 '20 at 12:20

1 Answers1

1

You can actually use df.isnull(). The reason for that is that it returns an "array" of booleans to indicate whether a value is missing. Therefore, if you first call isnull and then sum you will get the correct count.

Example:

import databricks.koalas as ks

df = ks.DataFrame([
  [1, 3, 9],
  [2, 3, 7],
  [3, None, 3]
], ["c1", "c2", "c3"])

df.isnull().sum()
Bram
  • 376
  • 1
  • 4
  • Thanks! I just retested with the same dataset and it's working now. Since the my data is the same but koalas version is newer I'm thinking they now support it. Good to know it's more inline with pandas. – SteveZ Mar 24 '21 at 15:43