1

I am new to PySpark.

I have read a parquet file. I only want to keep columns that have atleast 10 values

I have used describe to get the count of not-null records for each column

How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file

df = spark.read.parquet(file)

col_count = df.describe().filter($"summary" == "count")

newpyspark
  • 91
  • 6
  • Possible duplicate of [How to drop columns based on multiple filters in a dataframe using PySpark?](https://stackoverflow.com/questions/58120774/how-to-drop-columns-based-on-multiple-filters-in-a-dataframe-using-pyspark) – pault Sep 27 '19 at 22:16
  • Admittedly it's not an exact duplicate, but the solution will basically be the same – pault Sep 27 '19 at 22:17

1 Answers1

3

You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):

# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')

# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()

# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]

# drop bad columns
df_new = df.drop(*bad_cols)

Some notes:

  • use @pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.

  • you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

jxc
  • 13,553
  • 4
  • 16
  • 34
  • n/p, just a reminder, if you want to also check and remove Date, TimeStamp columns, this will not help since df.describe or df.summary won't count these columns. have a good weekend:) – jxc Sep 28 '19 at 04:02
  • 1
    thanks @jxc. The parquet file had the attributes as a Row object of structs in a list. So, I was able to use groupby and count to find the bad columns and then filter them out with a where clause. And, this solution handles all datatypes. The examples provided were helpful to reach a generic solution – newpyspark Sep 30 '19 at 21:17