3

The following code and hence a question on performance - imagine of course at scale:

import org.apache.spark.sql.types.StructType

val df = sc.parallelize(Seq(
   ("r1", 1, 1),
   ("r2", 6, 4),
   ("r3", 4, 1),
   ("r4", 1, 2)
   )).toDF("ID", "a", "b")

val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)

// or

def ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)

df.withColumn("ones", ones).explain

Here under two Physical Plans for when using def and val - which are the same:

 == Physical Plan == **def**
 *(1) Project [_1#760 AS ID#764, _2#761 AS a#765, _3#762 AS b#766, (CASE WHEN (_2#761 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#762 = 1) THEN 1 ELSE 0 END) AS ones#770]
 +- *(1) SerializeFromObject [staticinvoke(class 
 org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#760, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#761, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#762]
   +- Scan[obj#759]


 == Physical Plan == **val**
 *(1) Project [_1#780 AS ID#784, _2#781 AS a#785, _3#782 AS b#786, (CASE WHEN (_2#781 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#782 = 1) THEN 1 ELSE 0 END) AS ones#790]
 +- *(1) SerializeFromObject [staticinvoke(class 
 org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#780, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#781, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#782]
    +- Scan[obj#779] 

So, there is the discussion on:

val vs def performance.

Then:

  • I see no difference in the .explains. OK.

  • From elsewhere: val evaluates when defined, def - when called.

  • I am assuming that it makes no difference whether a val or def is used here as it essentially within a loop and there is a reduce. Is this correct?
  • Will df.schema.map(c => c.name).drop(1) be executed per dataframe row? There is of course no need. Does Catalyst optimize this?
  • If the above is true in that the statement is executed every time for the columns to process, how can we make that piece of code occur just once? Should we make a val of val ones = df.schema.map(c => c.name).drop(1)
  • val, def is more than Scala, also Spark component.

To the -1er I ask thus this as the following is very clear, but the val ones has more to it than code below and the below is not iterated:

var x = 2 // using var as I need to change it to 3 later
val sq = x*x // evaluates right now
x = 3 // no effect! sq is already evaluated
println(sq)
thebluephantom
  • 16,458
  • 8
  • 40
  • 83

2 Answers2

8

There are two core concepts at hand here, Spark DAG creation and evaluation, and Scala's val vs def definitions, these are orthogonal

I see no difference in the .explains

You see no difference because from Spark's perspective, the query is the same. It doesn't matter to the analyser if you store the graph in a val or create it each time with a def.

From elsewhere: val evaluates when defined, def - when called.

This is Scala semantics. A val is an immutable reference which gets evaluated once at the declaration site. A def stands for method definition, and if you allocate a new DataFrame inside it, it will create one each time you call it. For example:

def ones = 
  df
   .schema
   .map(c => c.name)
   .drop(1)
   .map(x => when(col(x) === 1, 1).otherwise(0))
   .reduce(_ + _)

val firstcall = ones
val secondCall = ones

The code above will build two separate DAGs over the DF.

I am assuming that it makes no difference whether a val or def is used here as it essentially within a loop and there is a reduce. Is this correct?

I'm not sure which loop you're talking about, but see my answer above for the distinction between the two.

Will df.schema.map(c => c.name).drop(1) be executed per dataframe row? There is of course no need. Does Catalyst optimize this?

No, drop(1) will happen for the entire data frame, which will essentially make it drop the first row only.

If the above is true in that the statement is executed every time for the columns to process, how can we make that piece of code occur just once? Should we make a val of val ones = df.schema.map(c => c.name).drop(1)

It does occur only once per data frame (which in your example we have exactly one of).

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • THX. But confused. 1) Do not get the val firstcall, secondcall. I do not call like that. 2) drop(1) for entire DF, how can we prove that? Makes sense for sure. 3) loop is processing the DF I mean. 4) tfe cal will execute faster I assume or does it not matter? – thebluephantom Feb 25 '19 at 07:08
  • Just to be really clear, this example is easy to follow: var x = 2 // using var as I need to change it to 3 later val sq = x*x // evaluates right now x = 3 // no effect! sq is already evaluated println(sq) , and I was trying to exactly extrapolate to the logic I presented. – thebluephantom Feb 25 '19 at 09:11
  • @thebluephantom It was just an example, to show the properties of `val` vs `def`. You can prove the behavior by creating a dummy DF and then running `drop(1).count`. Not sure what your last sentence means. – Yuval Itzchakov Feb 25 '19 at 09:30
  • I am sure you are right. The last sentence is that very simple example of val vs def is clear to follow. What I had was a little less obvious - imho – thebluephantom Feb 25 '19 at 09:36
  • 1
    If the answer gets +4 then surely an upvote on the question is valid! – thebluephantom Feb 25 '19 at 11:08
  • Is there a recommendation on best practice here when using this to define dataframes that will be operated on a lot? Something like "generally use val, some cases where you'd want to use def might look like x" – Brendan May 11 '20 at 23:13
  • @Brendan If your definition of a datastream is constant, and you want to reuse the same datastream, use a `val`. If you want to generate a *new* datastream per call, use a `def`. – Yuval Itzchakov May 12 '20 at 08:25
1

The ones expression won't get evaluated per dataframe row, it will be evaluated once. def get's evaluated per call. For example, if there are 3 dataframes using that ones expression, then the ones expression will be evaluated 3 times. The difference between val is, is that the expression would only be evaluated once.

Basically, the ones expression creates an instance of org.apache.spark.sql.Column where org.apache.spark.sql.Column = (CASE WHEN (a = 1) THEN 1 ELSE 0 END + CASE WHEN (b = 1) THEN 1 ELSE 0 END). If the expression is a def then a new org.apache.spark.sql.Column is instantiated every time it is called. If the expression is a val, then the same instance is used over and over.

uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64
  • It meaning? Not clear what you mean. There is only one dataframe. – thebluephantom Feb 24 '19 at 23:40
  • But it has a reduce? It is inside a loop. – thebluephantom Feb 24 '19 at 23:51
  • I may be mising something, but ... var x = 2 // using var as I need to change it to 3 later val sq = x*x // evaluates right now x = 3 // no effect! sq is already evaluated println(sq) I get this, but this is different – thebluephantom Feb 24 '19 at 23:58
  • This example I posted fine, easy to follow – thebluephantom Feb 25 '19 at 00:00
  • 1
    I think of it in two ways. One is Spark and one is scala. Spark will evaluate the SQL expression every time. But `def` won't be evaluated from a scala perspective every row. There is one call, and it returns `org.apache.spark.sql.Column = (CASE WHEN (a = 1) THEN 1 ELSE 0 END + CASE WHEN (b = 1) THEN 1 ELSE 0 END)`. Spark will use this Spark SQL expression to get a value for every row, but Scala will only evaluate to obtain this Spark SQL expression once. If there is a `df2.withColumn("ones", ones)`, then it will be evaluated once more if it is `def`, but not evaluated again for `val`. – uh_big_mike_boi Feb 25 '19 at 00:08
  • What about the columns list? – thebluephantom Feb 25 '19 at 00:12
  • I find the Phys Pl hard to follow at times. – thebluephantom Feb 25 '19 at 00:14
  • https://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function – thebluephantom Feb 25 '19 at 00:15
  • Having doubts based on above. – thebluephantom Feb 25 '19 at 00:15
  • Yeah in the example you linked, that can apply to your example too. Basically, a new instance of `org.apache.spark.sql.Column` is instantiated every time a call uses something defined by a `def`. If it is `val`, it just uses the same one over and over. – uh_big_mike_boi Feb 25 '19 at 00:29
  • I am not totally convinced. – thebluephantom Feb 25 '19 at 01:22