sparkR 1.4.0 : how to include jars

Question

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.apache.hadoop.io.NullWritable')

But for others, it does not:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.elasticsearch.hadoop.mr.LinkedMapWritable')

Yielding the error:

java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat

It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.

Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.

Thanks! Ben

There was a bug in the way jars specified with `--jars` were being used in SparkR. We had a fix checked in a couple of days ago at https://github.com/apache/spark/pull/7001. If you build Spark from the master branch you should be able to try this out. — Shivaram Venkataraman, Jun 28 '15 at 02:42

score 1 · Answer 1 · answered Aug 07 '15 at 16:07

Here's how I've achieved it:

# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)

# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)

(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)

The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.

How to pass ES Host Name ? – gsuresh92 Sep 30 '15 at 11:24 — gsuresh92, Sep 30 '15 at 11:24

sparkR 1.4.0 : how to include jars

1 Answers1