0

I am trying to convert my Spark Scala scripts (written in spark-shell) as Scala Class, Object, methods (def), etc. so I create JARs for spark-submit. I make a lot of calls using Spark SQL that performs a lot of timestamp computations with respect to timezone. I have to set the following configuration explicitly (because every distributed node may have different default timezone configured) to make sure my timezone will always be UTC for any subsequent Spark SQL timestamp manipulations by any Spark SQL function calls (code block) within that method.

spark.conf.set("spark.sql.session.timeZone", "UTC")

Should that method signature includes (spark: org.apache.spark.sql.SparkSession) as a parameter, so I can always start with the explicit code statement for setting the timezone to UTC to SparkSession without taking any chances (that all the distributed Spark nodes may or may not have the exact same timezone configurations)?

The next concerning question I have is, how do I find out if the "spark" variable setup by the spark-shell is a val or var? In search of the answer to this question, I found this code snippet in the hope to find out if this Scala variable is immutable or mutable. But it did not tell me if Scala variable spark is a var or a val. Do I need to return spark back to the method caller after I set the spark.sql.session.timeZone to UTC because I modified it in my method? Currently my method signature expects two input parameters (org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame) and the output is a tuple (org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame).

scala> def manOf[T: Manifest](t: T): Manifest[T] = manifest[T]
manOf: [T](t: T)(implicit evidence$1: Manifest[T])Manifest[T]

scala> manOf(List(1))
res3: Manifest[List[Int]] = scala.collection.immutable.List[Int]

scala> manOf(spark)
res2: Manifest[org.apache.spark.sql.SparkSession] = org.apache.spark.sql.SparkSession

Extra context: As part of launching spark-shell, the variable spark is initialized as follows:

Spark context available as 'sc' (master = yarn, app id = application_1234567890_111111).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_REDACTED)
Type in expressions to have them evaluated.
Type :help for more information.
geekyj
  • 402
  • 6
  • 10
  • It is a val, in any case, if you are not going to mutate it _(which wouldn't make sense)_ it doesn't matter what it is on the shell, just make it a `val`. – Luis Miguel Mejía Suárez Feb 24 '20 at 21:34
  • The fact that I am setting the timezone to UTC via `spark.conf.set("spark.sql.session.timeZone", "UTC")`, which is necessary pre-step for my Spark SQL timezone computation to be correct, isn't this consider a mutation? – geekyj Feb 24 '20 at 22:41
  • Yeah that is a mutation, but you can have a mutable object on a `val` - if you do not understand the difference between mutable object vs mutable reference, I would encourage you to study more about **Scala** before proceeding with **Spark**. – Luis Miguel Mejía Suárez Feb 24 '20 at 22:43
  • Thank you for the hint that `spark` is a `val` and that it is a mutable object! – geekyj Feb 25 '20 at 01:58
  • 1
    A good reference to mutable object. http://otfried.org/courses/cs109scala/tutorial_mutable.html – geekyj Feb 25 '20 at 02:40

1 Answers1

1

Thanks to @Luis Miguel Mejía Suárez for providing me the answers & hints as comments to my question. I implemented the following experiment that spark is a mutable object, where I was just using spark as an identical reference to the same object outside the method and inside the method. Although this undesirable side-effect is not a pure functional implementation, but it does save me the trouble to return the spark object back to the caller for other subsequent processing. If someone else has a better solution, please do share.

def x(spark: SparkSession, inputDF: DataFrame) = {
  import spark.implicits._
  spark.conf.set("spark.sql.session.timeZone", "UTC") // mutation of the object inside method

  //...spark.sql.functions...
  finalDF
}

Launched spark-shell and executed the following:

Spark context available as 'sc' (master = yarn, app id = application_1234567890_222222).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_REDACTED)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.get("spark.sql.session.timeZone")
res1: String = America/New_York

scala> :load x.scala
x: (spark: org.apache.spark.sql.SparkSession, inputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

scala> val timeConvertedDF = x(spark, inputDF)
timeConvertedDF: org.apache.spark.sql.DataFrame = [att1: timestamp, att2: string ... 25 more fields]

scala> spark.conf.get("spark.sql.session.timeZone")
res4: String = UTC
geekyj
  • 402
  • 6
  • 10