16

I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|
LDropl
  • 846
  • 3
  • 9
  • 25

4 Answers4

22

I think the best way would be to simply use "filter".

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|
10

Another possible way could be using a where function of DF.

For example this:

val output = df.where("col1>col2")

will give you the expected result:

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+
pheeleeppoo
  • 1,491
  • 6
  • 25
  • 29
7

The best way to keep rows based on a condition is to use filter, as mentioned by others.

To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark. For example to delete all rows with col1>col2 use:

rows_to_delete = df.filter(df.col1>df.col2)

df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
MMizani
  • 121
  • 1
  • 9
1

you can use sqlContext to simplify the challenge.

first register as temp table as example: df.createOrReplaceTempView("tbl1") then run the sql like sqlContext.sql("select * from tbl1 where col1 > col2")

jagath
  • 238
  • 1
  • 6
  • you can use below to create sqlContext: `from pyspark.sql import SQLContext sqlContext = SQLContext.getOrCreate(sc).sparkSession` – jagath Sep 18 '18 at 23:52
  • thank you! as I right understood sqlContext.sql(...) will return a dataframe right ? – LDropl Sep 19 '18 at 00:07