How to store data in a Spark cluster using sparklyr?

Question

If I connect to a Spark cluster, copy some data to it, and disconnect, ...

library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)

then the next time I connect to Spark, the data is not there.

sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)

This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.

How do I persist data in the Spark cluster between connections?

I thought sdf_persist() might be what I want, but it appears not.

It's because data doesn't persist over different spark sessions, which is what happens if you disconnect and than reconnect again. — mtoto, Feb 23 '17 at 13:50
@mtoto Thanks. So there is no way to keep a session alive when you disconnect? — Richie Cotton, Feb 23 '17 at 14:04
Can you try with `sdf_persist(storage.level = "DISK_ONLY")` ? I'm not sure that it will work thought. I have never tried that with spark to be honest — eliasah, Feb 23 '17 at 14:06
@RichieCotton Probably only an issue in `"local"` mode. But to connect to a remote cluster, you'll need rstudio server installed on the cluster as well. — mtoto, Feb 23 '17 at 14:24
@eliasah Sorry, `sdf_persist(storage.level = "DISK_ONLY")` doesn't work; it still connects to an empty session. — Richie Cotton, Feb 23 '17 at 17:11
@RichieCotton Did you learn something new about this problem? — Alex, Apr 30 '17 at 20:35
@Alex There is no permanence between clusters. People seem to just keep clusters running indefinitely, or save/reload their datasets using `spark_write_parquet()` and `spark_read_parquet()` (much faster than `copy_to()`). — Richie Cotton, May 02 '17 at 19:49

score 2 · Accepted Answer · answered Apr 27 '17 at 13:56

2

Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.

https://en.wikipedia.org/wiki/Apache_Spark

answered Apr 27 '17 at 13:56

Andrew Troiano

187
1
8

2

yeah, this seems about right. But is there a workaround for this?, some way to more tightly integrate Spark with a database or filesystem so that the data, only loaded, is always available every time you fire up Spark? ..., of course you can always load the data up during the next session. But at least in my experience, copying the data to Spark is time consuming. – Hernando Casas Apr 28 '17 at 17:55
1

Good question, I haven't seen anything like that. What I typically do is save my datasets in iterations as parquet files and load them as needed. So if you have a large set of data that takes a long time to run, load it, do an initial set of work, save that work and when you start later, load in that intermediate file. – Andrew Troiano May 01 '17 at 16:17

How to store data in a Spark cluster using sparklyr?

1 Answers1