I am wondering if it is possible to write a SimpleFeature to Cassandra in a Spark context? I am trying to map SimpleFeatures for my data into a Spark RDD, but I am having some issues. The following createFeature() function that is being called works fine in a standalone unit test, and I have another unit test that calls it and is successful in writing to Cassandra via the GeoMesa api with the SimpleFeature that it produces:
import org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator
. . .
private val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host","localhost").set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.kryo.registrator",classOf[GeoMesaSparkKryoRegistrator].getName).setAppName(appName).setMaster(master)
. . .
val rowsRDD = processedRDD.map(r => {
...
println("** NAME VALUE MAP **")
for ((k,v) <- featureNamesValues) printf("key: %s, value: %s\n", k, v)
val feature = MyGeoMesaManager.createFeature(featureTypeConfig.asJava,featureNamesValues.asJava)
feature
})
rowsRDD.print()
However, the fact that I now have the function call inside of an RDD's map() function in a Spark context causes a Serialization error on SimpleFeatureImpl due to Spark partitioning:
18/02/12 08:00:46 ERROR Executor: Exception in task 0.0 in stage 19.0 (TID
9)
java.io.NotSerializableException: org.geotools.feature.simple.SimpleFeatureImpl
Serialization stack:
- object not serializable (class: org.geotools.feature.simple.SimpleFeatureImpl, value: SimpleFeatureImpl:myfeature=[SimpleFeatureImpl.Attribute: . . ., SimpleFeatureImpl.Attribute: . . .])
- element of array (index: 0)
- array (class [Lorg.opengis.feature.simple.SimpleFeature;, size 4)
So ok, I then added the kyro dependency mentioned on the geomesa spark core page in an effort to mitigate this, but now I am getting a NoClassDefFoundError on the GeoMesaSparkKryoRegistrator class when the map function executes, but as you can see the geomesa-spark-core dependency exists on the classpath, and I am able to import the class:
18/02/12 08:08:37 ERROR Executor: Exception in task 0.0 in stage 26.0 (TID
11)
java.lang.NoClassDefFoundError: Could not initialize class org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:36)
at org.locationtech.geomesa.spark.GeoMesaSparkKryoRegistrator$$anon$1.write(GeoMesaSparkKryoRegistrator.scala:32)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And finally, I tried to add the com.esotericsoftware.kryo dependency to the classpath, but I got the same error.
Is it going to be possible to do what I am trying to do with GeoMesa, Spark, and Cassandra? It feels like I am on the 1 yard line but I can't quite punch it in.