Filtering a Pyspark DataFrame with SQL-like IN clause

Question

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

where a is the tuple (1, 2, 3). I am getting this error:

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

How would I filter in this case?

score 76 · Accepted Answer · edited Jun 11 '21 at 06:45

76

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

edited Jun 11 '21 at 06:45

Ani Menon

27,209
16
105
126

answered Mar 08 '16 at 19:20

zero323

322,348
103
959
935

For the second method, you can achieve the same by doing df.where(df.v.isin({"foo", "bar"})).count() – mar tin Mar 09 '16 at 10:47
3

You can, but personally I don't like this approach. With `col` I can easily decouple SQL expression and particular `DataFrame` object. So you can for example keep a dictionary of useful expressions and just pick them when you need. With explicit DF object you'll have to put it inside a function and it doesn't compose that well. – zero323 Mar 09 '16 at 11:32
How can this be done with a list of tuples? If I have e.g. `[(1,1), (1,2), (1,3)]` where one is `aid` and the other is `bid` for example. It would have to be something like `col(['aid', 'bid]).isin([(1,1), (1,2)])` – Stefan Falk Mar 26 '18 at 09:07

score 36 · Answer 2 · edited May 15 '17 at 21:47

36

reiterating what @zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below

from pyspark.sql.functions import col

df.where(col("v").isin(["foo", "bar"])).count()

edited May 15 '17 at 21:47

zero323

322,348
103
959
935

answered Dec 22 '16 at 16:35

braj

2,545
2
29
40

1

@zero323 is there a negation of is in LIKE not in sparksql. – E B Apr 05 '18 at 21:29
3

Yes. You can use '~' – pissall Jun 08 '18 at 13:08

score 11 · Answer 3 · edited Jun 12 '21 at 09:03

11

Just a little addition/update:

choice_list = ["foo", "bar", "jack", "joan"]

If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then

from pyspark.sql.functions import col

df_filtered = df.where( ( col("v").isin (choice_list) ) )

edited Jun 12 '21 at 09:03

Marioanzas

1,663
2
10
33

answered Jun 27 '18 at 15:20

shwetabharti

141
1
2

score 6 · Answer 4 · answered Jun 05 '19 at 22:20

6

You can also do this for integer columns:

df_filtered = df.filter("field1 in (1,2,3)")

or this for string columns:

df_filtered = df.filter("field1 in ('a','b','c')")

answered Jun 05 '19 at 22:20

BICube

4,451
1
23
44

score 0 · Answer 5 · answered Nov 09 '18 at 15:12

A slightly different approach that worked for me is to filter with a custom filter function.

def filter_func(a):
"""wrapper function to pass a in udf"""
    def filter_func_(col):
    """filtering function"""
        if col in a.value:
            return True

    return False

return udf(filter_func_, BooleanType())

# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \

user2662006 · Answer 6 · 2022-04-05T17:05:00.787

0

from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

edited Apr 05 '22 at 17:05

answered Apr 05 '22 at 16:37

user2662006

2,246
22
16

Filtering a Pyspark DataFrame with SQL-like IN clause

6 Answers6

Linked

Related