i can connect to MongoDB from SparkR (i am using R Studio, Spark 2.x.x, Mongo connector v2.0) as described here https://docs.mongodb.com/spark-connector/current/r-api/. I would like to do the same using SparklyR, is that possible? Could not find any examples for it.
3 Answers
I'm also trying to load from mongo by using sparklyr. I haven't found a solution yet but this is what I have tried so far (my database is "nasa" and the collection is "eva"):
library(sparklyr)
spark_home <- "/home/my_user_name/Workspaces/Mongo/spark-2.0.1-bin-hadoop2.7/"
Sys.setenv(SPARK_HOME=spark_home)
config <- sparklyr::spark_config()
config$sparklyr.defaultPackages <- c("org.mongodb.spark:mongo-spark-connector_2.10:1.1.0")
config$spark.mongodb.input.uri <- "mongodb://localhost:27017/nasa.eva"
config$spark.mongodb.output.uri <- "mongodb://localhost:27017/nasa.eva"
Spark.connection <<- sparklyr::spark_connect(master = "local", version = "2.0.1", config = config)
Spark.session <<- sparklyr::invoke_static(Spark.connection, "org.apache.spark.sql.SparkSession", "builder") %>% sparklyr::invoke("config", "spark.cassandra.connection.host", "localhost") %>% sparklyr::invoke("getOrCreate")
uri <- "mongodb://localhost/nasa.eva"
load <- invoke(Spark.session, "read") %>% invoke("format", "com.mongodb.spark.sql.DefaultSource") %>% invoke("option", "spark.mongodb.input.uri", uri) %>% invoke("option", "keyspace", "nasa") %>% invoke("option", "table", "eva") %>% invoke("load")
tbl <- sparklyr:::spark_partition_register_df(Spark.connection, load, "mongo_tbl", 0, TRUE)
It does not work yet, but maybe can give you some ideas. I hope it helps

- 1
Finally it seems it's possible but there is an easier way by installing the development version of sparklyr:
devtools::install_github("rstudio/sparklyr")
followed by:
config <- spark_config()
config$sparklyr.defaultPackages <- c("org.mongodb.spark:mongo-spark-connector_2.10:1.1.0")
sc <- spark_connect(master = "local", config = config)
uri <- "mongodb://localhost/nasa.eva"
spark_read_source(
sc,
"spark-table-name",
"com.mongodb.spark.sql.DefaultSource",
list(
spark.mongodb.input.uri = uri,
keyspace = "nasa",
table = "eva"),
memory = FALSE)
"nasa" and "eva" are the mongo database and the mongo collection, respectively. You can find more information here, at the sparklyr github forum. I hope this helps!

- 1
Does anyone have an update on this issue?
I could connect MongoDB > Spark > PySpark successfully on my local machine, but cannot find any material or solution to connect MongoDB > Spark > RStudio with sparklyr.

- 1,691
- 1
- 9
- 7