1

I've tried several permutations of the suggestions in How to load csv file into SparkR on RStudio? but I am only able to get the inmemory to Spark solution to work:

Sys.setenv(SPARK_HOME='C:/Users/myuser/apache/spark-1.6.1-bin-hadoop2.6')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))

library(SparkR)
sparkR.stop()
sc=sparkR.init(master="local")
sqlContext=sparkRSQL.init(sc)

df=read.csv(file="C:/.../file.csv",     
            header=T,sep=",",na.strings = c('NULL',''),fileEncoding = "UTF-8-BOM",stringsAsFactors = F)

df<- createDataFrame(sqlContext, df)
df=dropna(df)
names(df)
summary(df)

A rub of the above is that if file.csv is too large to fit in memory, then it causes problems. (A hack is to load a series of csv files and rbind them in sparkR.) Reading the CSV file via the read.df is preferred.

If I change the init to be:

sc <- sparkR.init(master='local', sparkPackages="com.databricks:spark-csv_2.11:1.2.0")

as suggested in order to use read.df, no matter what I do sparkR is now hosed.

df <- read.df(sqlContext, "C:/file.csv",          source="com.databricks.spark.csv", header="true", inferSchema="true")

or even

df<- createDataFrame(sqlContext, df)

Pukes:

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
    at java.lang.ProcessBuilder.start(Unknown Source)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
    at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:406)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7

What is the missing pixie-dust for SparkR?

Is there a simplier way to specify or confirm the correct databricks settings 2.11:1.2.0?

Is there a way to load a tab delimited file or some other format that doesn't require databricks?

P.S. I have noticed that H2O is much more pleasant to integrate with R and doesn't require arcane incantations. The sparkR folks really need to make starting sparkR a 1 liner IMHO...

Community
  • 1
  • 1
Chris
  • 1,219
  • 2
  • 11
  • 21

2 Answers2

2

The following works flawlessly for me:

Sys.setenv(SPARKR_SUBMIT_ARGS='"--packages" "com.databricks:spark-csv_2.11:1.4.0" "sparkr-shell"')
Sys.setenv(SPARK_HOME='/path/to/spark')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)

sparkR.stop()

sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, 
              "/path/to/mtcars.csv", 
              source="com.databricks.spark.csv", 
              inferSchema="true")

I put spark-csv_2.11-1.4.0.jar (latest jar) into the spark/jars directory, modified the env var appropriately then did the rest. collect(df) shows it works.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • When you say 'modified the env var appropriately', do you only mean via the R code above, or I'd there a Windows environment var that must be set? – Chris Jun 10 '16 at 02:35
  • 1
    the code above. the only "external" mod was ensuring the path to the spark binary directory (bin & sbin) were in my `PATH`. That should not impact the above though. – hrbrmstr Jun 10 '16 at 04:26
0

Pre-built Spark 1.x distributions are built with Scala 2.10, not 2.11. So, if you use such a distribution (which it seems you do), you need also a spark-csv build that is for Scala 2.10, not for Scala 2.11 (as the one you use in your code). Change spark-csv_2.11 to spark-csv_2.10, and it should work fine (see also accepted SO answers here and here).

Community
  • 1
  • 1
desertnaut
  • 57,590
  • 26
  • 140
  • 166