Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set

Question

I'm using spark 1.6 and run into the issue above when I run the following code:

// Imports
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SaveMode
import scala.concurrent.ExecutionContext.Implicits.global
import java.util.Properties
import scala.concurrent.Future

// Set up spark on local with 2 threads
val conf = new SparkConf().setMaster("local[2]").setAppName("app")
val sc = new SparkContext(conf)
val sqlCtx = new HiveContext(sc)

// Create fake dataframe
import sqlCtx.implicits._
var df = sc.parallelize(1 to 50000).map { i => (i, i, i, i, i, i, i) }.toDF("a", "b", "c", "d", "e", "f", "g").repartition(2)
// Write it as a parquet file
df.write.parquet("/tmp/parquet1")
df = sqlCtx.read.parquet("/tmp/parquet1")

// JDBC connection
val url = s"jdbc:postgresql://localhost:5432/tempdb"
val prop = new Properties()
prop.setProperty("user", "admin")
prop.setProperty("password", "")

// 4 futures - at least one of them has been consistently failing for
val x1 = Future { df.write.jdbc(url, "temp1", prop) }
val x2 = Future { df.write.jdbc(url, "temp2", prop) }
val x3 = Future { df.write.jdbc(url, "temp3", prop) }
val x4 = Future { df.write.jdbc(url, "temp4", prop) }

Here's the github gist: https://gist.github.com/karanveerm/27d852bf311e39f05491

The error I get is: at

org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrame.foreachPartition(DataFrame.scala:1482) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:247) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:306) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at writer.SQLWriter$.writeDf(Writer.scala:75) ~[temple.temple-1.0-sans-externalized.jar:na]
        at writer.Writer$.writeDf(Writer.scala:33) ~[temple.temple-1.0-sans-externalized.jar:na]
        at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:460) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
        at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:452) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
        at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) ~[org.scala-lang.scala-library-2.11.7.jar:na]

Is this a spark bug or am I doing something wrong / any workarounds?

may I ask on what machine you ran this code? I'm especially interested in the CPU (how many cores)? — Mateusz Dymczyk, Jan 13 '16 at 04:04
OSX El Capitan 10.11.1 | MacBook Air (13-inch, Early 2014) | 1.7 GHz Intel Core i7 | 8 GB 1600 MHz DDR3 (I believe i7 is 4 cores) — sparknoob, Jan 13 '16 at 04:12
interesting, I cannot reproduce this on a similar setup (from spark shell). This might be some nasty bug, they had problems with ID generation before. You might want to create a JIRA for that. — Mateusz Dymczyk, Jan 13 '16 at 06:34

score 3 · Answer 1 · answered Jan 13 '16 at 05:59

After trying several things, I found that one of the threads created by the global ForkJoinPool gets its spark.sql.execution.id property set to a random value. I could not identify the process that actually did that but I could work around it by using my own ExecutionContext.

import java.util.concurrent.Executors
import concurrent.ExecutionContext
val executorService = Executors.newFixedThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(executorService)

I used code from http://danielwestheide.com/blog/2013/01/16/the-neophytes-guide-to-scala-part-9-promises-and-futures-in-practice.html. Maybe the ForkJoinPool clones threads attributes when creating new ones and if this happens during the context of an SQL execution it would get its non null value whereas a FixedThreadPool will create the threads at instantiation.

I've run into the same issue. But this solution doesn't seem to help. I still see the `spark.sql.execution.id already set` error. — mottosan, Mar 17 '16 at 14:26
@smas the problem is not in the number of threads but in the initialization of those threads. The fork join pool will initialize threads at random times and to initialize new threads it clones all attributes. So if at the time of initialization of a new thread the existing thread has a SQL execution id set it will copy it to the new one instead of letting a new one be generated. — Knshiro, Oct 22 '17 at 00:43

score 1 · Answer 2 · answered Dec 21 '17 at 09:08

1

Please, check SPARK-13747

Consider to use Spark version 2.2.0 or higher if applicable in your environment.

answered Dec 21 '17 at 09:08

morsik

1,250
14
17

score 0 · Answer 3 · answered Jan 13 '16 at 05:51

Test 1: Does it help if you run each of the df.write operation in serial fashion instead of parallel future?

Test 2: Does it help if you persist the dataframe and then do all the df.write operation in parallel and seralize to unpersist after all are complete to see if this helps?

Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set

3 Answers3

Linked