I'm new to Spark, SparkR and generally all HDFS-related technologies. I've installed recently Spark 1.5.0 and run some simple code with SparkR:
Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6")
.libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib")
require('SparkR')
require('data.table')
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)
n = 1000
x = data.table(id = 1:n, val = rnorm(n))
Sys.time()
xs <- createDataFrame(sqlContext, x)
Sys.time()
The code executes immediately. However when I change it to n = 1000000
it takes about 4 minutes (time between two Sys.time()
calls). When I check these jobs in console on port :4040, job for n = 1000
has duration 0.2s, and job for n = 1000000
0.3s. Am I doing something wrong?