0

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.apache.hadoop.io.NullWritable')

But for others, it does not:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.elasticsearch.hadoop.mr.LinkedMapWritable')

Yielding the error:

java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat

It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.

Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.

Thanks! Ben

  • There was a bug in the way jars specified with `--jars` were being used in SparkR. We had a fix checked in a couple of days ago at https://github.com/apache/spark/pull/7001. If you build Spark from the master branch you should be able to try this out. – Shivaram Venkataraman Jun 28 '15 at 02:42

1 Answers1

1

Here's how I've achieved it:

# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)

# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)

(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)

The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.

Alex Ioannides
  • 1,204
  • 9
  • 10