0

Problem:

I have 2 tables (d1 & d2) containing Geo-spatial points. I want to carry out the following query:

select * from table 1 where table1.point is within 50km of any point in table2.point

I am using Spark-SQL with GeoMesa & Accumulo to achieve the same. (Spark as processing engine, Accumulo as Data Store & GeoMesa for GeoSpatial libraries).

The above query is kind of left semi join but I am not sure on how to achieve it using Spark-SQL because as far as I have read subqueries can't be used in where clause.

kaxil
  • 17,706
  • 2
  • 59
  • 78

1 Answers1

0

Was able to achieve this using:

select * from d1 left semi join d2 on st_contains(st_bufferPoint(d1.point, 10000.0), d2.point)

Spark broadcasted d2 & is carrying out joins but it is still taking more time as the size of d1 is 5 billion & d2 is 10 million.

Not sure though if there is any more efficient way to achieve the same.

kaxil
  • 17,706
  • 2
  • 59
  • 78