0

I have a huge spark dataframe living in a cluster. The count shows there to be 24 million rows. There are also 900+ columns.

Most of these columns are empty. I'm thinking of dropping the columns that are mostly empty. Or get a list of columns that are not mostly empty.

I'm currently looping over columns:

for col in ALL_COLUMNS[1:]:
    test_df = df.select(col)
    NNcount = test_df.filter(test_df[col].isin(["NULL", "", None]) == False).count()
    # more logic ..

And selecting afterwards, the problem is, each iteration of this loop takes minutes.

Is there a faster way to drop columns based on nulls? preferably not needing to loop over the entire column - and obviously more elegant than this.

Perhaps the answer is already out there but I'm failing to find the match after some searching. Thanks!

Rocky Li
  • 5,641
  • 2
  • 17
  • 33
  • 1
    [Count number of non-NaN entries in each column of Spark dataframe with Pyspark](https://stackoverflow.com/q/33900726/10465355), [Count the number of missing values in a dataframe Spark](https://stackoverflow.com/q/44413132/10465355) – 10465355 Feb 25 '19 at 19:39
  • Thanks, but I'm not sure if this will be faster, I'll test it out. – Rocky Li Feb 25 '19 at 19:40
  • I have never worked with Spark, but wouldn't df.info() do the job? – Finrod Felagund Feb 25 '19 at 19:49
  • @FinrodFelagund Spark doesn't have `df.info()`. It does have `df.describe()` Which is very slow (probably the same mechanism). – Rocky Li Feb 25 '19 at 19:59

1 Answers1

0

If you want to get a list of column names with mostly (say 20%) null values try this -

na_pct = 0.2
cols_to_drop = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count()>= na_pct]

However, if you want to remove those column with and keep the rest then use try -

cols_to_keep = [x for x in df.columns if df.filter(F.col(x).isNull()).count()/df.count() < na_pct]
df = df.select(*cols_to_keep)
Loochie
  • 2,414
  • 13
  • 20