I need an efficient way to list and drop unary columns in a Spark DataFrame (I use the PySpark
API). I define a unary column as one which has at most one distinct value and for the purpose of the definition, I count null
as a value as well. That means that a column with one distinct non-null
value in some rows and null
in other rows is not a unary column.
Based on the answers to this question I managed to write an efficient way to obtain a list of null columns (which are a subset of my unary columns) and drop them as follows:
counts = df.summary("count").collect()[0].asDict()
null_cols = [c for c in counts.keys() if counts[c] == '0']
df2 = df.drop(*null_cols)
Based on my very limited understanding of the inner workings of Spark this is fast because the method summary manipulates the entire data frame simultaneously (I have roughly 300 columns in my initial DataFrame). Unfortunately, I cannot find a similar way to deal with the second type of unary columns - ones which have no null
values but are lit(something)
.
What I currently have is this (using the df2
I obtain from the code snippet above):
prox_counts = (df2.agg(*(F.approx_count_distinct(F.col(c)).alias(c)
for c in df2.columns
)
)
.collect()[0].asDict()
)
poss_unarcols = [k for k in prox_counts.keys() if prox_counts[k] < 3]
unar_cols = [c for c in poss_unarcols if df2.select(c).distinct().count() < 2]
Essentially, I first find columns which could be unary in a fast but approximate way and then look at the "candidates" in more detail and more slowly.
What I don't like about it is that a) even with the approximative pre-selection it is still fairly slow, taking over a minute to run even though at this point I only have roughly 70 columns (and about 6 million rows) and b) I use the approx_count_distinct
with the magical constant 3
(approx_count_distinct
does not count null
, hence 3
instead of 2
). Since I'm not exactly sure how the approx_count_distinct
works internally I am a little worried that 3
is not a particularly good constant since the function might estimate the number of distinct (non-null
) values as say 5 when it really is 1 and so maybe a higher constant is needed to guarantee nothing is missing in the candidate list poss_unarcols
.
Is there a smarter way to do this, ideally so that I don't even have to drop the null columns separately and do it all in one fell swoop (although that is actually quite fast and so that big a big issue)?