Filtering a pyspark dataframe using isin by exclusion

Question

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

As an example:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))

I get the data frame:

+---+---+
| id|bar|
+---+---+
|  1|  a|
|  2|  b|
|  3|  b|
|  4|  c|
|  5|  d|
+---+---+

I only want to exclude rows where bar is ('a' or 'b').

Using an SQL expression string it would be:

df.filter('bar not in ("a","b")').show()

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

Edit:

I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

score 104 · Accepted Answer · answered Jan 23 '17 at 11:44

104

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.

df.filter(~col('bar').isin(['a','b'])).show()



+---+---+
| id|bar|
+---+---+
|  4|  c|
|  5|  d|
+---+---+

answered Jan 23 '17 at 11:44

gabrown86

1,719
3
12
18

8

It is a default python operator, in folk language called the `not` operator. In the documentation it is called the `unary bitwise invert operator`: http://docs.python.org/reference/expressions.html#unary-arithmetic-and-bitwise-operations – Laurens Koppenol Jul 26 '18 at 06:41

score 35 · Answer 2 · edited Apr 27 '17 at 09:36

35

Also could be like this

df.filter(col('bar').isin(['a','b']) == False).show()

edited Apr 27 '17 at 09:36

Pedro del Sol

2,840
9
39
52

answered Apr 27 '17 at 09:17

Alezis

1,182
3
13
25

1

is there any difference between this and using the invert operator (`~`)? – Munick Nov 02 '17 at 04:36
I dont think so – Alezis Nov 02 '17 at 12:40
2

In this case I think there is no real difference since `isin` returns either `True` or `False`. However, in general, comparison like `x == False` is discouraged and might yield surprising results. – Dror Feb 05 '19 at 07:20

score 29 · Answer 3 · answered Aug 02 '18 at 07:06

Got a gotcha for those with their headspace in Pandas and moving to pyspark

 from pyspark import SparkConf, SparkContext
 from pyspark.sql import SQLContext

 spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
 sc = SparkContext(conf = spark_conf)
 sqlContext = SQLContext(sc)

 records = [
     {"colour": "red"},
     {"colour": "blue"},
     {"colour": None},
 ]

 pandas_df = pd.DataFrame.from_dict(records)
 pyspark_df = sqlContext.createDataFrame(records)

So if we wanted the rows that are not red:

pandas_df[~pandas_df["colour"].isin(["red"])]

Looking good, and in our pyspark DataFrame

pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()

So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617 So to include nothingness in our results:

pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()

A `when` function would probably work here too. – scottlittle Sep 20 '22 at 14:33 — scottlittle, Sep 20 '22 at 14:33

Assaf Mendelson · Answer 4 · 2017-01-21T10:36:24.443

16

df.filter((df.bar != 'a') & (df.bar != 'b'))

edited Jan 21 '17 at 10:36

answered Jan 21 '17 at 07:05

Assaf Mendelson

12,701
5
47
56

Filtering a pyspark dataframe using isin by exclusion

4 Answers4

Linked

Related