71

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))
Babu
  • 4,324
  • 6
  • 41
  • 60

8 Answers8

151

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))
Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
  • 2
    What is the job of the `*` in `*array`? – Joe May 20 '19 at 12:21
  • 1
    *variable is python syntax for expanding an array to dump it's elements into the function parameters one at a time in order. – Ryan Widmaier May 20 '19 at 13:04
  • Note that `dataframe.column` is case sensitive! Alternatively, you can use the dictionary syntax `dataframe[column]`, which is not :) – information_interchange Aug 14 '19 at 20:44
  • Should you not use `is False` instead of `== False`? – rjurney Sep 12 '19 at 19:10
  • 2
    @rjurney No. What the `==` operator is doing here is calling the overloaded `__eq__` method on the `Column` result returned by `dataframe.column.isin(*array)`. That's overloaded to return another column result to test for equality with the other argument (in this case, `False`). The `is` operator tests for object identity, that is, if the objects are actually the same place in memory. If you use `is` here, it would always fail because the constant `False` doesn't ever live in the same memory location as a `Column`. Additionally, you can't overload `is`. – Josephine Moeller Sep 27 '19 at 21:23
  • @JohnMoeller Thanks, John. I appreciate it. – rjurney Sep 28 '19 at 17:42
  • 1
    List splatting with * does not make any difference here. You can just use `isin(array)` and it works just fine. – Michał Jabłoński Aug 04 '20 at 10:38
  • @Joe See Michal's comment above - Spark now supports `list` as argument. – flow2k Aug 10 '20 at 03:42
  • In my opinion it would have been a better design if `column.not_in()` or `column.is_not_in()` was implemented. – Joop Feb 25 '22 at 14:39
52

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
LaSul
  • 2,231
  • 1
  • 20
  • 36
15
df_result = df[df.column_name.isin([1, 2, 3]) == False]
approxiblue
  • 6,982
  • 16
  • 51
  • 59
user7438406
  • 151
  • 1
  • 2
4

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)
Grant Shannon
  • 4,709
  • 1
  • 46
  • 36
3

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.

user2321864
  • 2,207
  • 5
  • 25
  • 35
2

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
Johnny M
  • 359
  • 2
  • 7
1

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)
Shadowtrooper
  • 1,372
  • 15
  • 28
  • I wouldn't recommend this in Big Data applications...it means you need to go through the whole dataset tree times...which is huge if you image you have few terrabytes to process – Babu Jun 18 '19 at 14:24
  • 2
    No, because Spark internally optimices this filter to make in 1 time this filters. – Shadowtrooper Jun 18 '19 at 17:26
  • 1
    then it should be ok ... until new breaking change Spark update or framework switch... and 3 rows instead 1 line + hidden optimisation seems still not good pattern for me...no offense, but I still would recommend to avoid it – Babu Jun 26 '19 at 13:07
1

You can also use sql functions .col + .isin():

import pyspark.sql.functions as F

array = [1,2,3]
df = df.filter(~F.col(column_name).isin(array))

This might be useful if you are using sql functions and want consistency.

yvanscher
  • 1,009
  • 1
  • 13
  • 16