17

How to bind variable in Apache Spark SQL? For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
user3769729
  • 171
  • 1
  • 1
  • 4

3 Answers3

16

Spark SQL (as of 1.6 release) does not support bind variables.

Update: Spark 3.4 will have "parametrized SQL queries" https://github.com/apache/spark/pull/38864

  spark.sql(
    sqlText = "SELECT * FROM tbl WHERE date > :startDate LIMIT :maxRows",
    args = Map(
      "startDate" -> "DATE'2022-12-01'",
      "maxRows" -> "100"))

and similarly Dataframe API

def sql(sqlText: String, args: Map[String, String]): DataFrame

ps. What Ashrith is suggesting is not a bind variable.. You're constructing a string every time. Every time Spark will parse the query, create execution plan etc. Purpose of bind variables (in RDBMS systems for example) is to cut time on creating execution plan (which can be costly where there are a lot of joins etc). Spark has to have a special API to "parse" a query and then to "bind" variables. Spark does not have this functionality (as of today, Spark 1.6 release).

Update 8/2018: as of Spark 2.3 there are (still) no bind variables in Spark.

Tagar
  • 13,911
  • 6
  • 95
  • 110
13

I verified it in both Spark shell 2.x shell and Thrift(beeline) as well. I could able to bind a variable in Spark SQL query with set command.

Query without bind variable:

select count(1) from mytable; 

Query with bind variable (parameterized):

1. Spark SQL shell

 set key_tbl=mytable; -- setting mytable to key_tbl to use as ${key_tbl}
 select count(1) from ${key_tbl};

2. Spark shell

spark.sql("set key_tbl=mytable")
spark.sql("select count(1) from ${key_tbl}").collect()

Both w/w.o bind params the query returns an identical result.

Note: Don't give any quotes to the value of key as it's table name here.

Let me know if there are any questions.

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • with val slqDF = spark.sql("select count(*) from Prd_IN_GeneralEvents where pdid like $'{id}'") Same Error – earl Jul 13 '17 at 15:46
  • 4
    those are NOT bind variables. – Tagar Apr 29 '19 at 15:04
  • @Tagar could you please explain why these are not bind variables ? Because f-strings doesn't seem the way to go – neverMind Oct 04 '20 at 00:35
  • 2
    @neverMind please see my separate answer above why these are not bind variables. Bind variables has to be supported by a backend (Spark in this case). It's an industry term specific for RDBMS systems. Bind variables allow you to create a prepared version of an execution plan in the backend, and execute (multiple) times same prepared statement with (perhaps different values of the bind variables) without re-analyzing the query, re-creating execution plan (which can be costly). This is not supported by Spark. Just substituting a literal value into SQL text doesn't make it a bind variable. – Tagar Oct 04 '20 at 03:21
-2

Pyspark

sqlContext.sql("SELECT * FROM src WHERE col1 = {1} and col2 = {2}".format(VAL1,VAL2).collect().foreach(println)
Vijay Krishna
  • 1,037
  • 13
  • 19
  • This is, how you introduce SQL injection vulnerabilities into your applications. It is almost always wrong to insert some variables into an SQL statement without properly escaping them. – H. Rittich Dec 16 '21 at 10:53