16

I'm using spark 1.6 and run into the issue above when I run the following code:

// Imports
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SaveMode
import scala.concurrent.ExecutionContext.Implicits.global
import java.util.Properties
import scala.concurrent.Future

// Set up spark on local with 2 threads
val conf = new SparkConf().setMaster("local[2]").setAppName("app")
val sc = new SparkContext(conf)
val sqlCtx = new HiveContext(sc)

// Create fake dataframe
import sqlCtx.implicits._
var df = sc.parallelize(1 to 50000).map { i => (i, i, i, i, i, i, i) }.toDF("a", "b", "c", "d", "e", "f", "g").repartition(2)
// Write it as a parquet file
df.write.parquet("/tmp/parquet1")
df = sqlCtx.read.parquet("/tmp/parquet1")

// JDBC connection
val url = s"jdbc:postgresql://localhost:5432/tempdb"
val prop = new Properties()
prop.setProperty("user", "admin")
prop.setProperty("password", "")

// 4 futures - at least one of them has been consistently failing for
val x1 = Future { df.write.jdbc(url, "temp1", prop) }
val x2 = Future { df.write.jdbc(url, "temp2", prop) }
val x3 = Future { df.write.jdbc(url, "temp3", prop) }
val x4 = Future { df.write.jdbc(url, "temp4", prop) }

Here's the github gist: https://gist.github.com/karanveerm/27d852bf311e39f05491

The error I get is: at

org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrame.foreachPartition(DataFrame.scala:1482) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:247) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:306) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
        at writer.SQLWriter$.writeDf(Writer.scala:75) ~[temple.temple-1.0-sans-externalized.jar:na]
        at writer.Writer$.writeDf(Writer.scala:33) ~[temple.temple-1.0-sans-externalized.jar:na]
        at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:460) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
        at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:452) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
        at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) ~[org.scala-lang.scala-library-2.11.7.jar:na]

Is this a spark bug or am I doing something wrong / any workarounds?

sparknoob
  • 1,266
  • 14
  • 15

3 Answers3

3

After trying several things, I found that one of the threads created by the global ForkJoinPool gets its spark.sql.execution.id property set to a random value. I could not identify the process that actually did that but I could work around it by using my own ExecutionContext.

import java.util.concurrent.Executors
import concurrent.ExecutionContext
val executorService = Executors.newFixedThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(executorService)

I used code from http://danielwestheide.com/blog/2013/01/16/the-neophytes-guide-to-scala-part-9-promises-and-futures-in-practice.html. Maybe the ForkJoinPool clones threads attributes when creating new ones and if this happens during the context of an SQL execution it would get its non null value whereas a FixedThreadPool will create the threads at instantiation.

Knshiro
  • 231
  • 2
  • 11
  • I've run into the same issue. But this solution doesn't seem to help. I still see the `spark.sql.execution.id already set` error. – mottosan Mar 17 '16 at 14:26
  • 1
    @smas the problem is not in the number of threads but in the initialization of those threads. The fork join pool will initialize threads at random times and to initialize new threads it clones all attributes. So if at the time of initialization of a new thread the existing thread has a SQL execution id set it will copy it to the new one instead of letting a new one be generated. – Knshiro Oct 22 '17 at 00:43
1

Please, check SPARK-13747

Consider to use Spark version 2.2.0 or higher if applicable in your environment.

morsik
  • 1,250
  • 14
  • 17
0

Test 1: Does it help if you run each of the df.write operation in serial fashion instead of parallel future?

Test 2: Does it help if you persist the dataframe and then do all the df.write operation in parallel and seralize to unpersist after all are complete to see if this helps?

Developer
  • 691
  • 8
  • 7