I want to drop the columns that are having same values throughout the dataframe. My dataframe consists of around 25K columns and 13K rows.
Below is code i have tried:
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
df.drop(*cols_to_drop).show()
While executing the code I'm facing memory issues. Following is the error i have received:
Py4JJavaError: An error occurred while calling o276142.collectToPython. : java.lang.OutOfMemoryError: Java heap space
Is there any faster and better way to tackle this problem?