0

I'm new to geospatial domain and I've managed to add geomesa-spark-jst to the project which enabled me use geospatial functions.

I need to go through milions of geocoded events (eventRdd) and based on a custom criteria see if they are within a certain distance from a road segment linestring (roadSegmentRdd).

Currently for each event I need to go through the entire roadSegmentRdd and see if the criteria is satistfied which is not optimal at all.

How can I use geomesa and indexes to make this query faster? What are the minimum needed dependencies?

Hedrack
  • 694
  • 1
  • 6
  • 19

1 Answers1

1

Typically, you would want to ingest at least your point data into a GeoMesa data store, which you could then query based on spatial predicates to efficiently filter down to the ones you are interested in.

GeoMesa has several different data store options you could use, from a fully distributed database like HBase to a lightweight file-system-based solution. The best one will depend on your performance requirements and available infrastructure. There is more information about the different data stores here, and Spark specific details here.

Once you have the data ingested, you could try one of the join approaches outlined here or here, depending on the size of your road segment RDD.

Emilio Lahr-Vivaz
  • 1,439
  • 6
  • 5
  • I'm using EMR which has HBase support. My `roadSegmentRdd` is not that big, 500k linestrings. Do I need to use HBase/Accumulo as store or can I somehow build the indexes in memory? Later I'd add some permanent store like HBase that loads from S3 if possible. I found this [link](http://www.geomesa.org/documentation/user/spark/sparksql.html#in-memory-indexing). Would this require `geomesa-spark-jst`, `geomesa-spark-core` and `geomesa-spark-sql`? – Hedrack May 22 '18 at 00:10
  • The capability to build an index memory is currently tied to loading data from a GeoMesa datastore. This can probably be refactored, but presently the quickest solution would be to ingest the road segments into S3 and then use the capability at that link to load the data frame up in memory. – GeoJim May 24 '18 at 15:35
  • As an additional note, the in-memory indexing is an alpha feature. Hopefully it'll be useful, but it may require some additional Spark configuration tweaks. – GeoJim May 24 '18 at 15:36