dynamically bind variable/parameter in Spark SQL?

Question

How to bind variable in Apache Spark SQL? For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)

Tagar · Answer 1 · 2022-12-15T20:32:57.910

Spark SQL (as of 1.6 release) does not support bind variables.

Update: Spark 3.4 will have "parametrized SQL queries" https://github.com/apache/spark/pull/38864

  spark.sql(
    sqlText = "SELECT * FROM tbl WHERE date > :startDate LIMIT :maxRows",
    args = Map(
      "startDate" -> "DATE'2022-12-01'",
      "maxRows" -> "100"))

and similarly Dataframe API

def sql(sqlText: String, args: Map[String, String]): DataFrame

ps. What Ashrith is suggesting is not a bind variable.. You're constructing a string every time. Every time Spark will parse the query, create execution plan etc. Purpose of bind variables (in RDBMS systems for example) is to cut time on creating execution plan (which can be costly where there are a lot of joins etc). Spark has to have a special API to "parse" a query and then to "bind" variables. Spark does not have this functionality (as of today, Spark 1.6 release).

Update 8/2018: as of Spark 2.3 there are (still) no bind variables in Spark.

mrsrinivas · Answer 2 · 2020-05-30T15:44:16.493

13

I verified it in both Spark shell 2.x shell and Thrift(beeline) as well. I could able to bind a variable in Spark SQL query with set command.

Query without bind variable:

select count(1) from mytable;

Query with bind variable (parameterized):

1. Spark SQL shell

 set key_tbl=mytable; -- setting mytable to key_tbl to use as ${key_tbl}
 select count(1) from ${key_tbl};

2. Spark shell

spark.sql("set key_tbl=mytable")
spark.sql("select count(1) from ${key_tbl}").collect()

Both w/w.o bind params the query returns an identical result.

Note: Don't give any quotes to the value of key as it's table name here.

Let me know if there are any questions.

edited May 30 '20 at 15:44

answered Nov 22 '16 at 06:18

mrsrinivas

34,112
13
125
125

with val slqDF = spark.sql("select count(*) from Prd_IN_GeneralEvents where pdid like $'{id}'") Same Error – earl Jul 13 '17 at 15:46
4

those are NOT bind variables. – Tagar Apr 29 '19 at 15:04
@Tagar could you please explain why these are not bind variables ? Because f-strings doesn't seem the way to go – neverMind Oct 04 '20 at 00:35
2

@neverMind please see my separate answer above why these are not bind variables. Bind variables has to be supported by a backend (Spark in this case). It's an industry term specific for RDBMS systems. Bind variables allow you to create a prepared version of an execution plan in the backend, and execute (multiple) times same prepared statement with (perhaps different values of the bind variables) without re-analyzing the query, re-creating execution plan (which can be costly). This is not supported by Spark. Just substituting a literal value into SQL text doesn't make it a bind variable. – Tagar Oct 04 '20 at 03:21

score -2 · Answer 3 · answered Oct 17 '17 at 07:19

-2

Pyspark

sqlContext.sql("SELECT * FROM src WHERE col1 = {1} and col2 = {2}".format(VAL1,VAL2).collect().foreach(println)

answered Oct 17 '17 at 07:19

Vijay Krishna

1,037
13
19

This is, how you introduce SQL injection vulnerabilities into your applications. It is almost always wrong to insert some variables into an SQL statement without properly escaping them. – H. Rittich Dec 16 '21 at 10:53

dynamically bind variable/parameter in Spark SQL?

3 Answers3

Linked