-1

So I have a pyspark dataframe, it contains 12 rows and 50 columns. I want to drop the columns that contains 0 more than 4 rows.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

mojek
  • 195
  • 1
  • 9

2 Answers2

1

In pyspark, you'll have to bring the count of zeros in every column into the driver using collect(). From memory wise this should not be a big problem, because you'll have one value per column. Try this,

from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,0,0),(1,0,4),(1,0,10),(2,1,90),(7,2,0),(0,3,11)],schema=['group','order','value'])
expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns]
tst_cnt = tst.select(*expr).collect()[0].asDict()
#%%
sel_coln =[x for x in tst_cnt.keys() if tst_cnt[x]<=2]
tst_final = tst.select(sel_coln)

I think, in sql syntax, you can do it in subquery.

Raghu
  • 1,644
  • 7
  • 19
  • What if I have a large number of columns and I want it to filter on the second column and beyond. In my case I have around 100+ columns. The first column is Date and second column onward would be the parameter. – mojek Jul 13 '20 at 02:37
  • 1
    you can manipulate the list comprehension as needed , for example: if you need to skip first two columns : ````expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns[2:]]```` if you want to skip certain columnns like date: ````expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns if 'date' not in coln]```` – Raghu Jul 13 '20 at 05:11
1

You can do the following:

# Creates test data. Field called "col5" won't match 
# the criteria set on the function "check_number"
df1 = spark.sql("select 1 col1, 4 col2, 0 col3, 1 col4, 0 col5")
df2 = spark.sql("select 2 col1, 9 col2, 5 col3, 7 col4, 0 col5")
df3 = spark.sql("select 3 col1, 2 col2, 6 col3, 5 col4, 0 col5")
df4 = spark.sql("select 4 col1, 7 col2, 7 col3, 3 col4, 1 col5")

df = df1.union(df2).union(df3).union(df4)
df.createOrReplaceTempView("df")

print("Original dataframe")
df.show()

# Please change the criteria to filter whatever you need. In this case this sample 
# returns true on the columns that have less than 2 zeros
def check_number(column_name):
    return spark.sql("select count(" + column_name + ") from df where " + column_name + " = 0").take(1)[0][0] < 2

fields = [x.name for x in df.schema.fields if check_number(x.name)]

print("After filtering dataframe")
df.select(fields).show()

On the function check_number you can place any criteria.

The output is

Original dataframe
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   1|   4|   0|   1|   0|
|   2|   9|   5|   7|   0|
|   3|   2|   6|   5|   0|
|   4|   7|   7|   3|   1|
+----+----+----+----+----+

After filtering dataframe
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   4|   0|   1|
|   2|   9|   5|   7|
|   3|   2|   6|   5|
|   4|   7|   7|   3|
+----+----+----+----+

As you can see I am using PySpark in combination with SQL

Oscar Lopez M.
  • 585
  • 3
  • 11
  • What if I have a large number of columns and I want it to filter on the second column and beyond because the first column is Date (I want to maintain that) and second column and next 100+ column is the parameter (which I want to filter). In my case I have around 100+ columns. – mojek Jul 13 '20 at 02:56
  • In that case you just need to create a particular filter on the `df.schema.fields` so that you take only the columns you need. `df.schema.fields` returns all the columns present on the table but to skip some of them you can create another list comprehension that filters out whatever columns you don't need. The rest of the logic should be same (apart from that particular change) – Oscar Lopez M. Jul 13 '20 at 11:14