Pyspark dataframe get columns that has mostly null values efficiently

Question

I have a huge spark dataframe living in a cluster. The count shows there to be 24 million rows. There are also 900+ columns.

Most of these columns are empty. I'm thinking of dropping the columns that are mostly empty. Or get a list of columns that are not mostly empty.

I'm currently looping over columns:

for col in ALL_COLUMNS[1:]:
    test_df = df.select(col)
    NNcount = test_df.filter(test_df[col].isin(["NULL", "", None]) == False).count()
    # more logic ..

And selecting afterwards, the problem is, each iteration of this loop takes minutes.

Is there a faster way to drop columns based on nulls? preferably not needing to loop over the entire column - and obviously more elegant than this.

Perhaps the answer is already out there but I'm failing to find the match after some searching. Thanks!

[Count number of non-NaN entries in each column of Spark dataframe with Pyspark](https://stackoverflow.com/q/33900726/10465355), [Count the number of missing values in a dataframe Spark](https://stackoverflow.com/q/44413132/10465355) — 10465355, Feb 25 '19 at 19:39
Thanks, but I'm not sure if this will be faster, I'll test it out. — Rocky Li, Feb 25 '19 at 19:40
I have never worked with Spark, but wouldn't df.info() do the job? — Finrod Felagund, Feb 25 '19 at 19:49
@FinrodFelagund Spark doesn't have `df.info()`. It does have `df.describe()` Which is very slow (probably the same mechanism). — Rocky Li, Feb 25 '19 at 19:59

Loochie · Answer 1 · 2022-02-02T22:44:06.730

If you want to get a list of column names with mostly (say 20%) null values try this -

na_pct = 0.2
cols_to_drop = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count()>= na_pct]

However, if you want to remove those column with and keep the rest then use try -

cols_to_keep = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count() < na_pct]
df = df.select(*cols_to_keep)

Pyspark dataframe get columns that has mostly null values efficiently

1 Answers1