Spark 2.0.0: SparkR CSV Import

Question

I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features.

Using RStudio here.

I am getting an error while "reading" the source file.

My code:

Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", appName = "SparkR")
df <- loadDF("F:/file.csv", "csv", header = "true")

I get an error at at the loadDF function.

The error:

loadDF("F:/file.csv", "csv", header = "true")

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt

Am I missing some specification here? Any pointers to proceed would be appreciated.

score 2 · Accepted Answer · edited May 23 '17 at 11:54

I have the same problem. But similary problem with this simple code

createDataFrame(iris)

May be some wrong in installation ?

UPD. YES ! I find solution.

This solution based on this: Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(...)

For R just start session by this code:

sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="/file:C:/temp"))

score 0 · Answer 2 · answered Aug 02 '16 at 20:31

0

Maybe you should try reading the CSV with this library

https://github.com/databricks/spark-csv

Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6")

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sparkR.session(master = "local[*]", appName = "SparkR")  

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-shell"')

sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true")

answered Aug 02 '16 at 20:31

Erick Díaz

98
2
15

Hi Erick, thank you for the response. But as I gather, Spark 2.0.0 has native "csv" suport which is why I tried exploring how we could "read" csv files directly. In addition, Spark 2.0.0 now uses "SparkR.session" method for initialization and sqlContext usage has been deprecated. (On their official webpage, they say one can directly operate on data frames without using sqlContext!) I am kind of lost because I get errors when I try to execute the examples. :( – turnip424 Aug 03 '16 at 04:45
spark-csv was indeed merged into Spark2, with some minor changes. The packaged should be considered legacy software - as your Github link also indicates. – Rick Moritz Apr 30 '17 at 12:02

Spark 2.0.0: SparkR CSV Import

2 Answers2

Linked