Check if all values of a column are equal in PySpark Dataframe

Question

I have to get rid of columns that don't add information to my dataset, i.e. columns with the same values in all the entries.

I devised two ways of doing this

for col in df.columns:
    if df.agg(F.min(col)).collect()[0][0] == df.agg(F.max(col)).collect()[0][0]:
        df = df.drop(col)

for col in df.columns:
    if df.select(col).distinct().count() == 1:
        df = df.drop(col)

Is there a better, faster or more straight forward way to do this?

You could try using UDF's which should increase performance as opposed to Loops — The Singularity, Sep 07 '21 at 10:28
https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html — The Singularity, Sep 09 '21 at 05:03

score 5 · Accepted Answer · answered Sep 07 '21 at 14:14

5

df = df.drop(*(col for col in df.columns if df.select(col).distinct().count() == 1))

answered Sep 07 '21 at 14:14

Leon Zajchowski

score 0 · Answer 2 · answered Apr 13 '22 at 01:17

I prefer to use the subtract method.

df1 = # Sample DataFrame #1
df2 = # Sample DataFrame #2

assert 0 == df1.subtract(df2).count()
assert 0 == df2.subtract(df1).count()

Another way is to check the union.

assert df1.count() == df1.union(df2).count()
assert df2.count() == df1.union(df2).count()

2 Answers2