6

If I connect to a Spark cluster, copy some data to it, and disconnect, ...

library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)

then the next time I connect to Spark, the data is not there.

sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)

This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.

How do I persist data in the Spark cluster between connections?

I thought sdf_persist() might be what I want, but it appears not.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 3
    It's because data doesn't persist over different spark sessions, which is what happens if you disconnect and than reconnect again. – mtoto Feb 23 '17 at 13:50
  • @mtoto Thanks. So there is no way to keep a session alive when you disconnect? – Richie Cotton Feb 23 '17 at 14:04
  • Can you try with `sdf_persist(storage.level = "DISK_ONLY")` ? I'm not sure that it will work thought. I have never tried that with spark to be honest – eliasah Feb 23 '17 at 14:06
  • 2
    @RichieCotton Probably only an issue in `"local"` mode. But to connect to a remote cluster, you'll need rstudio server installed on the cluster as well. – mtoto Feb 23 '17 at 14:24
  • @eliasah Sorry, `sdf_persist(storage.level = "DISK_ONLY")` doesn't work; it still connects to an empty session. – Richie Cotton Feb 23 '17 at 17:11
  • @RichieCotton Did you learn something new about this problem? – Alex Apr 30 '17 at 20:35
  • 1
    @Alex There is no permanence between clusters. People seem to just keep clusters running indefinitely, or save/reload their datasets using `spark_write_parquet()` and `spark_read_parquet()` (much faster than `copy_to()`). – Richie Cotton May 02 '17 at 19:49

1 Answers1

2

Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.

https://en.wikipedia.org/wiki/Apache_Spark

Andrew Troiano
  • 187
  • 1
  • 8
  • 2
    yeah, this seems about right. But is there a workaround for this?, some way to more tightly integrate Spark with a database or filesystem so that the data, only loaded, is always available every time you fire up Spark? ..., of course you can always load the data up during the next session. But at least in my experience, copying the data to Spark is time consuming. – Hernando Casas Apr 28 '17 at 17:55
  • 1
    Good question, I haven't seen anything like that. What I typically do is save my datasets in iterations as parquet files and load them as needed. So if you have a large set of data that takes a long time to run, load it, do an initial set of work, save that work and when you start later, load in that intermediate file. – Andrew Troiano May 01 '17 at 16:17