5

How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs.

#Set sys environment variables
Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

#Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')

#Load libraries
library(SparkR)
library(magrittr)

sc <- sparkR.init(master="local")
sc <- sparkR.init()
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)

data <- read.df(sqlContext, "C:/Users/Desktop/DataSets/hello_world.csv", "com.databricks.spark.csv", header="true")

I am getting error:

Error in writeJobj(con, object) : invalid jobj 1
zero323
  • 322,348
  • 103
  • 959
  • 935
sharp
  • 2,140
  • 9
  • 43
  • 80

3 Answers3

3

Spark 2.0.0+:

You can use csv data source:

loadDF(sqlContext, path="some_path", source="csv", header="true")

without loading spark-csv.

Original answer:

As far as I can tell you're using a wrong version of spark-csv. Pre-built versions of Spark are using Scala 2.10, but you're using Spark CSV for Scala 2.11. Try this instead:

sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.10:1.2.0")
zero323
  • 322,348
  • 103
  • 959
  • 935
  • I tried new spark-csv from above. Now I when I run data <- read.df, I get this error: Error: returnStatus == 0 is not TRUE. – sharp Oct 01 '15 at 00:19
  • Could you provide a full stacktrace? – zero323 Oct 01 '15 at 00:21
  • Are you referring to R console outputs? – sharp Oct 01 '15 at 00:29
  • Yep, there should be much more than `Error: returnStatus == 0` – zero323 Oct 01 '15 at 00:31
  • 'Error: returnStatus == 0 is not TRUE' is only error I got when I was running 'read.df'. – sharp Oct 01 '15 at 00:36
  • Do you have the same problem when you copy input file to the working directory and use a relative path? – zero323 Oct 01 '15 at 00:43
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91039/discussion-between-zero323-and-sharp). – zero323 Oct 01 '15 at 00:50
  • sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.10:1.0.3") did it for me (and starting sparkR console like this: sparkR --packages com.databricks:spark-csv_2.10:1.0.3) – ElinaJ Oct 17 '15 at 09:09
1

I successfully solve this issue by providing the commons-csv-1.2.jar together with the spark-csv package.

Apparently, spark-csv uses commons-csv but is not package with it.

Using the following SPARKR_SUBMIT_ARGS solved the issue (I use --jars rather than --packages).

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--jars" "/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/spark-csv_2.11-1.2.0.jar,/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/commons-csv-1.2.jar" "sparkr-shell"')

In fact, the rather obscure error

Error in writeJobj(con, object) : invalid jobj 1

Is more clear using the R shell directly instead from R Studio and clearly state

java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat

The needed commons-csv jar can be found here : https://commons.apache.org/proper/commons-csv/download_csv.cgi

loicmathieu
  • 5,181
  • 26
  • 31
1

I appreciate everyone's input and solutions!!! I figured out another way to load .csv file into SparkR RStudio. Here it is:

#set sc
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)

#load .csv 
patients <- read.csv("C:/...") #Insert your .csv file path

df <- createDataFrame(sqlContext, patients)
df
head(df)
str(df)
sharp
  • 2,140
  • 9
  • 43
  • 80
  • 2
    Your solution works, but is not scalable: when your patients set does not fit into memory, you won't be able to load in R and convert in to SparkR, but you should still be able to load it directly to SparkR. – Wannes Rosiers Oct 14 '15 at 14:00
  • Good point. I did run into this. However, with the user's answers below, I am getting errors. Trying to look how I can load the data directly into SparkR. – sharp Oct 14 '15 at 14:48
  • The Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"... works fine for me. – Wannes Rosiers Oct 16 '15 at 06:41