0

I am using GeoMesa Spark on a Databricks cluster referring to this sample notebook: GeoMesa - NYC Taxis. I had no problem importing and using UDF functions such as st_makePoint and st_intersects. However, when I try to use st_geoHash to create a column of geohash of the points, I got this error:

NoClassDefFoundError: Could not initialize class org.locationtech.geomesa.spark.jts.util.GeoHash$.

The cluster has geomesa-spark-jts_2.11:3.2.1 and scala-logging_2.11:3.8.0 installed, which are the two given by the notebook (but with a different version of GeoMesa, 2.3.2 in the notebook while 3.2.1 on my cluster). I am new to GeoMesa and Databricks platform. I wonder if I missed some dependencies for the Geohash class to work.

  • Did you install the dependencies as a Maven package? – Emilio Lahr-Vivaz Sep 30 '21 at 19:42
  • Also, you could try a bundled jar like the geomesa-gt-spark-runtime_2.11 one? – GeoJim Sep 30 '21 at 20:26
  • @GeoMesaJim Thank you for your suggestion, but how do I install that? I tried install it directly from Maven, but it gave me an error saying: `RuntimeException: unresolved dependency: org.geotools:gt-process-feature:23.3: not found`. Then I tried to find this geotools package but could not find the one with the exact name. I am curious how do I set up dependencies in general on a cluster. I am not quite familiar with Linux or virtual environment. – haiqing liu Oct 06 '21 at 18:32
  • @EmilioLahr-Vivaz What do you mean? I searched and installed the packages from the Maven Libraries like AbhishekKhandave-MT pointed out below. – haiqing liu Oct 06 '21 at 19:20
  • You can install the geomesa-gt-spark-runtime_2.11 jar by downloading it from maven, then installing it with the "upload" option. that jar is a shaded jar that contains all the required dependencies. – Emilio Lahr-Vivaz Oct 07 '21 at 13:13
  • Re: installing the original jar, yes, I was wondering if you installed it as a Maven package (which should pull the transitive dependencies) or just as an uploaded jar. Thank you for clarifying. – Emilio Lahr-Vivaz Oct 07 '21 at 13:14
  • @EmilioLahr-Vivaz Installing geomesa-gt-spark-runtime_2.11 from an uploaded jar worked. Thank you so much. But I still could not use geohash. – haiqing liu Oct 07 '21 at 17:22
  • You could try removing the scala-logging dependency, if you haven't already. We've had to do things like this to work around DataBricks classpath, possibly their environment has changed and created a new conflict: https://github.com/locationtech/geomesa/blob/geomesa-3.3.0/geomesa-gt/geomesa-gt-spark-runtime/pom.xml#L143-L147 – Emilio Lahr-Vivaz Oct 08 '21 at 19:59
  • @EmilioLahr-Vivaz I changed to geomesa-gt-spark-runtime_2.12 and now it works. I did not pay attention to the cluster's Scala version. Thank you so much for your help. – haiqing liu Oct 11 '21 at 19:55

2 Answers2

1

(Updated 18 OCT) I am one of the original contributors to this notebook along with Derek Yeager who was the primary author. Complex frameworks like geomesa may require more special attention as our UI maven support on clusters is built for streamlined library installs, more here. The notebook was originally built for Spark 2.4 (DBR 6.x) on a fat jar of geomesa that we generated back at that time (late 2019). That jar shaded some dependency conflicts with DBR. Instead of the fat jar, you could use Databricks Container Services which can be useful for deploying more complex frameworks over our clusters. Should mention that DBR 7+ / Spark 3+ is Scala 2.12 only, so you wouldn't expect Scala 2.11 to work on those runtimes.

CCRi (backers of geomesa) has generated Databricks friendly build. A shaded fat jar for GeoMesa (current version is 3.3.0) is available at the maven coordinates org.locationtech.geomesa:geomesa-gt-spark-runtime_2.12:3.3.0 which is for spark runtimes (such as Databricks)​.​ ​S​ince it is shaded, users can add maven exclusions to get it to cleanly install which would be "jline:*,org.geotools:*" added in Databricks library UI without quotes​. I have been able to execute the notebook you referenced (with some small changes) on DBR 9.1 LTS (for Spark 3.1).

  1. One change from the initial notebook is you no longer need to specially add com.typesafe.scala-logging:scala-logging_2.11:3.8.0
  2. Another is that you cannot change the spark configs inside the session, so you would minimally comment out (and could potentially add to cluster config if want):
// spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// spark.conf.set("spark.kryo.registrator", classOf[GeoMesaSparkKryoRegistrator].getName) 
// spark.conf.set("spark.kryoserializer.buffer.max","500m")
  1. Comment out the skew hint and let Spark 3.x AQE handle:
//short circuit on geohash and apply geospatial predicate when necessary

val joined = trips.as("L")
    // - let AQE handle skew
    //.hint("skew", "pickup_geohash_25", pickup_skews).as("L")
    .join(
      neighborhoodsDF.as("R"),
      ( $"L.pickup_geohash_25" === $"R.geohash" ) && 
      ( st_contains($"R.polygon", $"L.pickupPoint") )
    )
    .drop("geohashes")
    .drop("geohash")
    .drop("polygon")

I have access to all the data from within the Databricks environment in which we produced the notebook, so I am assuming you are somehow reconstituting the data on your side if you are attempting to execute your copy the notebook.

  • Thank you so much for your detailed explanation. I looked into your notebook because I was working on a large GPS points dataset from INRIX and had performance issues on spatial join. So I tried to implement Geohash method. Now the speed does improve a lot. Although I have used Spark/PySpark for a while, I am pretty new to Geomesa, and know few about configuration and optimization of the Spark environment (especially on a virtual environment). Your help is much appreciated. – haiqing liu Oct 20 '21 at 20:42
0

I would recommend you to install same version of geomesa-spark-jts_2.11 as given in this notebook.

To install geomesa-spark-jts_2.11:2.3.2 follow below steps:

Step1: Click on install library.

Step2: Select Maven, Search and install geomesa-spark-jts_2.11:2.3.2.

Step3: You can also download jar file and upload it to Library Source.

enter image description here

Abhishek K
  • 3,047
  • 1
  • 6
  • 19
  • Thank you. I tried that but it didn't work. – haiqing liu Oct 06 '21 at 18:27
  • Although the GeoMesa jars are hosted on Maven central, the geotools jars are hosted on a custom repository (https://repo.osgeo.org/repository/release). I'm not sure if you can use multiple repositories when adding a Maven package in Databricks. – Emilio Lahr-Vivaz Oct 07 '21 at 13:17