1

I want to drop the columns that are having same values throughout the dataframe. My dataframe consists of around 25K columns and 13K rows.

Below is code i have tried:

col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

df.drop(*cols_to_drop).show()

While executing the code I'm facing memory issues. Following is the error i have received:

Py4JJavaError: An error occurred while calling o276142.collectToPython. : java.lang.OutOfMemoryError: Java heap space

Is there any faster and better way to tackle this problem?

Shyam
  • 11
  • 2
  • From my little knowledge, there are methods to drop duplicates within pandas and upper modules layers (pyspark may also). Have you at least made a 5 min search ? https://stackoverflow.com/questions/31064243/remove-duplicates-from-a-dataframe-in-pyspark – LoneWanderer Dec 10 '19 at 20:54

0 Answers0