0

I am wondering if it is possible to write a SimpleFeature to Cassandra in a Spark context? I am trying to map SimpleFeatures for my data into a Spark RDD, but I am having some issues. The following createFeature() function that is being called works fine in a standalone unit test, and I have another unit test that calls it and is successful in writing to Cassandra via the GeoMesa api with the SimpleFeature that it produces:

import org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator

. . .

private val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host","localhost").set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.kryo.registrator",classOf[GeoMesaSparkKryoRegistrator].getName).setAppName(appName).setMaster(master)

. . .                                            

val rowsRDD = processedRDD.map(r => {

...

println("** NAME VALUE MAP **")

for ((k,v) <- featureNamesValues) printf("key: %s, value: %s\n", k, v)

val feature = MyGeoMesaManager.createFeature(featureTypeConfig.asJava,featureNamesValues.asJava)
feature
})

rowsRDD.print()

However, the fact that I now have the function call inside of an RDD's map() function in a Spark context causes a Serialization error on SimpleFeatureImpl due to Spark partitioning:

18/02/12 08:00:46 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID 
9)
java.io.NotSerializableException: org.geotools.feature.simple.SimpleFeatureImpl
Serialization stack:
- object not serializable (class: org.geotools.feature.simple.SimpleFeatureImpl, value: SimpleFeatureImpl:myfeature=[SimpleFeatureImpl.Attribute: . . ., SimpleFeatureImpl.Attribute: . . .])
- element of array (index: 0)
- array (class [Lorg.opengis.feature.simple.SimpleFeature;, size 4)

So ok, I then added the kyro dependency mentioned on the geomesa spark core page in an effort to mitigate this, but now I am getting a NoClassDefFoundError on the GeoMesaSparkKryoRegistrator class when the map function executes, but as you can see the geomesa-spark-core dependency exists on the classpath, and I am able to import the class:

18/02/12 08:08:37 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 
11)
java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:36)
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:32)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And finally, I tried to add the com.esotericsoftware.kryo dependency to the classpath, but I got the same error.

Is it going to be possible to do what I am trying to do with GeoMesa, Spark, and Cassandra? It feels like I am on the 1 yard line but I can't quite punch it in.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291

1 Answers1

1

The easiest way to set up the classpath is to use maven with the maven shade plugin. Add a dependency on the geomesa-cassandra-datastore and geomesa-spark-geotools modules:

<dependency>
  <groupId>org.locationtech.geomesa</groupId>
  <artifactId>geomesa-cassandra-datastore_2.11</artifactId>
</dependency>
<dependency>
  <groupId>org.locationtech.geomesa</groupId>
  <artifactId>geomesa-spark-geotools_2.11</artifactId>
</dependency>

Then add a maven shade plugin, similar to the one used here for Accumulo. Submit your spark job using the shaded jar, and the classpath should have everything required.

Emilio Lahr-Vivaz
  • 1,439
  • 6
  • 5
  • Ok yes, I do have those two dependencies on the classpath, although I am not using the shaded jar. Will the inclusion of the correct dependencies be enough to serialize the objects and get them to write correctly to Cassandra via the GeoMesa api, or do I also need the GeoMesaSpark object to get a spatialRDDProvider, etc, as the geomesa-spark-core page mentions? I reviewed the code on github and I do not think there is a SpatialRDDProvider for Cassandra. So this is my biggest concern - is it even *possible* to do what I am trying to do (ie - with *Cassandra* and Spark, not Accumulo and Spark) – user1930364 Feb 12 '18 at 15:09
  • It seems like you were planning to use the regular Cassandra data store from inside spark, which should work fine. The inclusion of the geomesa spark module is mainly to get the serialization bits. Alternatively, although there isn't an optimized `SpatialRDDProvider` for Cassandra, you can use the generic `GeoToolsSpatialRDDProvider` instead. I'd still suggest creating a shaded jar to set up the classpath. – Emilio Lahr-Vivaz Feb 12 '18 at 15:22
  • Ok thanks for the replies and guidance. Yes, that is what I am doing currently and it does seem to work fine in my unit test that does not include the Spark context. I have done a lot of work to get to this point, and I am just hoping that the serialization/kyro dependency issues are all that is keeping me from writing to Cassandra at this point. Ok I don't know much about shaded jars but will research it. Thanks again Emilio. I will post back if successful – user1930364 Feb 12 '18 at 15:28
  • It seems like there is something not set in the environment, as the call to GeoMesaSparkKryoRegistratorEndpoint.init() is what is failing on line 43, which is "Option(SparkEnv.get).foreach {" when GeoMesaSparkKryoRegistrator is trying to be loaded. I just do not know why SparkEnv.get is failing. Could there be some other spark or geomesa kyro system property that needs to be set? I see a reference to "spark.geomesa.kryo.rpc.enable" inside of GeoMesaSparkKryoRegistratorEndpoint. Thanks – user1930364 Feb 12 '18 at 18:09
  • What I am trying to do is run this in a unit test (like a Spark integration test), I am not submitting the job to test it. This is why I am hoping to resolve the dependency problem in the unit test environment so that I don't have to submit a Spark job multiple times. I have many other tests like this that connect to my local Spark, and I have already written to Cassandra via GeoMesa through one of these tests, so my setup is correct for all of the other tests. It is just the Spark GeoMesa dependencies that seem to be causing issues with this one unit test where I am trying to create an sf RDD – user1930364 Feb 12 '18 at 18:15