Remove rows from dataframe based on condition in pyspark

Question

I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

score 22 · Answer 1 · answered Sep 19 '18 at 13:21

22

I think the best way would be to simply use "filter".

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

answered Sep 19 '18 at 13:21

Kuldip Puri Tejaswi

442
2
6

score 10 · Accepted Answer · answered Sep 19 '18 at 08:06

10

Another possible way could be using a where function of DF.

For example this:

val output = df.where("col1>col2")

will give you the expected result:

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+

answered Sep 19 '18 at 08:06

pheeleeppoo

1,491
6
25
29

score 7 · Answer 3 · answered Apr 13 '21 at 19:29

The best way to keep rows based on a condition is to use filter, as mentioned by others.

To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark. For example to delete all rows with col1>col2 use:

rows_to_delete = df.filter(df.col1>df.col2)

df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')

score 1 · Answer 4 · answered Sep 18 '18 at 23:49

1

you can use sqlContext to simplify the challenge.

first register as temp table as example: df.createOrReplaceTempView("tbl1") then run the sql like sqlContext.sql("select * from tbl1 where col1 > col2")

answered Sep 18 '18 at 23:49

jagath

238
1
6

you can use below to create sqlContext: `from pyspark.sql import SQLContext sqlContext = SQLContext.getOrCreate(sc).sparkSession` – jagath Sep 18 '18 at 23:52
thank you! as I right understood sqlContext.sql(...) will return a dataframe right ? – LDropl Sep 19 '18 at 00:07

Remove rows from dataframe based on condition in pyspark

4 Answers4

Linked