I have a huge spark dataframe living in a cluster. The count
shows there to be 24 million rows. There are also 900+ columns.
Most of these columns are empty. I'm thinking of dropping the columns that are mostly empty. Or get a list of columns that are not mostly empty.
I'm currently looping over columns:
for col in ALL_COLUMNS[1:]:
test_df = df.select(col)
NNcount = test_df.filter(test_df[col].isin(["NULL", "", None]) == False).count()
# more logic ..
And selecting afterwards, the problem is, each iteration of this loop takes minutes.
Is there a faster way to drop columns based on nulls? preferably not needing to loop over the entire column - and obviously more elegant than this.
Perhaps the answer is already out there but I'm failing to find the match after some searching. Thanks!