Pyspark dataframe operator "IS NOT IN"

Question

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

Ryan Widmaier · Accepted Answer · 2020-08-10T12:50:59.577

151

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))

edited Aug 10 '20 at 12:50

answered Oct 27 '16 at 15:53

Ryan Widmaier

7,948
2
30
32

2

What is the job of the `*` in `*array`? – Joe May 20 '19 at 12:21
1

*variable is python syntax for expanding an array to dump it's elements into the function parameters one at a time in order. – Ryan Widmaier May 20 '19 at 13:04
Note that `dataframe.column` is case sensitive! Alternatively, you can use the dictionary syntax `dataframe[column]`, which is not :) – information_interchange Aug 14 '19 at 20:44
Should you not use `is False` instead of `== False`? – rjurney Sep 12 '19 at 19:10
2

@rjurney No. What the `==` operator is doing here is calling the overloaded `__eq__` method on the `Column` result returned by `dataframe.column.isin(*array)`. That's overloaded to return another column result to test for equality with the other argument (in this case, `False`). The `is` operator tests for object identity, that is, if the objects are actually the same place in memory. If you use `is` here, it would always fail because the constant `False` doesn't ever live in the same memory location as a `Column`. Additionally, you can't overload `is`. – Josephine Moeller Sep 27 '19 at 21:23
@JohnMoeller Thanks, John. I appreciate it. – rjurney Sep 28 '19 at 17:42
1

List splatting with * does not make any difference here. You can just use `isin(array)` and it works just fine. – Michał Jabłoński Aug 04 '20 at 10:38
@Joe See Michal's comment above - Spark now supports `list` as argument. – flow2k Aug 10 '20 at 03:42
In my opinion it would have been a better design if `column.not_in()` or `column.is_not_in()` was implemented. – Joop Feb 25 '22 at 14:39

score 52 · Answer 2 · answered Dec 18 '18 at 14:26

52

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

answered Dec 18 '18 at 14:26

LaSul

2,231
1
20
36

10

Everyone here, shouldn't this be the accepted answer? Why use this not-so-evident-to-understand `== False` when we have `~` specifically for negation? – Sasha Tsukanov Oct 09 '19 at 12:50
1

Also, `*` was useless – ciurlaro Apr 06 '20 at 08:21

score 15 · Answer 3 · edited Jan 19 '17 at 00:30

15

df_result = df[df.column_name.isin([1, 2, 3]) == False]

edited Jan 19 '17 at 00:30

approxiblue

6,982
16
51
59

answered Jan 19 '17 at 00:23

user7438406

151
1
2

score 4 · Answer 4 · answered Nov 14 '17 at 09:07

4

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

answered Nov 14 '17 at 09:07

Grant Shannon

4,709
1
46
36

score 3 · Answer 5 · edited Nov 16 '20 at 19:21

3

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.

edited Nov 16 '20 at 19:21

user2321864

2,207
5
25
35

answered Nov 16 '20 at 14:52

raphael dayan

31
1

score 2 · Answer 6 · answered Feb 18 '20 at 13:58

2

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

answered Feb 18 '20 at 13:58

Johnny M

359
2
7

score 1 · Answer 7 · answered Jun 18 '19 at 09:26

1

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

answered Jun 18 '19 at 09:26

Shadowtrooper

1,372
15
28

I wouldn't recommend this in Big Data applications...it means you need to go through the whole dataset tree times...which is huge if you image you have few terrabytes to process – Babu Jun 18 '19 at 14:24
2

No, because Spark internally optimices this filter to make in 1 time this filters. – Shadowtrooper Jun 18 '19 at 17:26
1

then it should be ok ... until new breaking change Spark update or framework switch... and 3 rows instead 1 line + hidden optimisation seems still not good pattern for me...no offense, but I still would recommend to avoid it – Babu Jun 26 '19 at 13:07

score 1 · Answer 8 · answered Feb 23 '23 at 18:20

1

You can also use sql functions .col + .isin():

import pyspark.sql.functions as F

array = [1,2,3]
df = df.filter(~F.col(column_name).isin(array))

This might be useful if you are using sql functions and want consistency.

answered Feb 23 '23 at 18:20

yvanscher

1,009
1
13
16

Pyspark dataframe operator "IS NOT IN"

8 Answers8

Linked

Related