I have an R script similar to the example one, where you load some data from hdfs and then store it somehow, in this case via Parquet file.
library(SparkR)
# Initialize SparkContext and SQLContext
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Create a DataFrame from a JSON file
peopleDF <- jsonFile(sqlContext, file.path("/people.json"))
# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")
# SQL statements can be run by using the sql methods provided by sqlContext
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
# Store the teenagers in a table
saveAsParquetFile(teenagers, file.path("/teenagers"))
# Stop the SparkContext now
sparkR.stop()
How exactly do I retrieve the data from the cluster into another spark application? I'm currently considering connecting to the hdfs master and retrieving the files according to this example, except for replacing the sbt-thrift with scrooge.
Is there a more idiomatic way to retrieve the data without a direct connection to the hadoop cluster? I considered copying the data out of the hdfs, but parquet can only read from hadoop from what I've understood.