(Updated 18 OCT) I am one of the original contributors to this notebook along with Derek Yeager who was the primary author. Complex frameworks like geomesa may require more special attention as our UI maven support on clusters is built for streamlined library installs, more here. The notebook was originally built for Spark 2.4 (DBR 6.x) on a fat jar of geomesa that we generated back at that time (late 2019). That jar shaded some dependency conflicts with DBR. Instead of the fat jar, you could use Databricks Container Services which can be useful for deploying more complex frameworks over our clusters. Should mention that DBR 7+ / Spark 3+ is Scala 2.12 only, so you wouldn't expect Scala 2.11 to work on those runtimes.
CCRi (backers of geomesa) has generated Databricks friendly build. A shaded fat jar for GeoMesa (current version is 3.3.0) is available at the maven coordinates org.locationtech.geomesa:geomesa-gt-spark-runtime_2.12:3.3.0
which is for spark runtimes (such as Databricks). Since it is shaded, users can add maven exclusions to get it to cleanly install which would be "jline:*,org.geotools:*
" added in Databricks library UI without quotes. I have been able to execute the notebook you referenced (with some small changes) on DBR 9.1 LTS (for Spark 3.1).
- One change from the initial notebook is you no longer need to specially add
com.typesafe.scala-logging:scala-logging_2.11:3.8.0
- Another is that you cannot change the spark configs inside the session, so you would minimally comment out (and could potentially add to cluster config if want):
// spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// spark.conf.set("spark.kryo.registrator", classOf[GeoMesaSparkKryoRegistrator].getName)
// spark.conf.set("spark.kryoserializer.buffer.max","500m")
- Comment out the skew hint and let Spark 3.x AQE handle:
//short circuit on geohash and apply geospatial predicate when necessary
val joined = trips.as("L")
// - let AQE handle skew
//.hint("skew", "pickup_geohash_25", pickup_skews).as("L")
.join(
neighborhoodsDF.as("R"),
( $"L.pickup_geohash_25" === $"R.geohash" ) &&
( st_contains($"R.polygon", $"L.pickupPoint") )
)
.drop("geohashes")
.drop("geohash")
.drop("polygon")
I have access to all the data from within the Databricks environment in which we produced the notebook, so I am assuming you are somehow reconstituting the data on your side if you are attempting to execute your copy the notebook.