5

In the sparklyr tutorial I'm following it says I can use compute() to store the results of the preceding dplyr statement into a new spark data frame.

The code in 'code 1' creates a new spark data frame called "NewSparkDataframe" and a spark_tbl is created which I assigned to "NewTbl". I can view the spark data frame using src_tbls(). This is all as expected.

If I instead run 'code 2' without using compute() it still creates a spark_tbl which I again assign to "NewTbl". This time though I'm unable to view the new spark data frame in spark using src_tbls().

I'm wondering how "NewTbl" is able to run the spark_tbl in code 2 if there's apparently no "NewSparkDataframe" in spark?

Also what is the point in using compute() if I can still access the same newly created spark_tbl with "NewTbl"?

code 1:

NewTbl <- mySparkTbl %>%
        some dplyr statements %>%
        compute("NewSparkDataframe")
src_tbls(spark_conn)
"NewSparkDataframe"

code 2:

NewTbl <- mySparkTbl %>%
        some dplyr statements
src_tbls(spark_conn)
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
Steve
  • 625
  • 2
  • 5
  • 17

1 Answers1

0

Spark uses lazy evaluation, which means that you're not really storing or creating a table in the second case. When you call NewTbl in case 2, you evaluate it every time you call NewTbl.

What compute does is force the evaluation up front and store it in your session. This is great when you have to reuse a table often.

Alex
  • 65
  • 1
  • 6