0

I have a question about HBase Query. I am seeing a lot of data getting scanned for small spatial queries. I fired the geospatial query on the OSMNodes table. Below are query and table details. I am seeing the total number of read requests (5,553,421,708) on HBase and seeing requests on most of the regions and region servers. Any reasoning why did we scan each region (entire table) for this query?

**Query**:

"DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"
**Table Schema:**

geomesa-hbase describe-schema -c atlas -f OSMNodes
INFO  Describing attributes of feature 'OSMNodes'
geometry           | Point     (Spatio-temporally indexed)
ingestionTimestamp | Timestamp (Spatio-temporally indexed)
nextTimestamp      | Timestamp 
serializerVersion  | String    
featurePayload     | String    

User data:
  geomesa.index.dtg    | ingestionTimestamp
  geomesa.indices      | z3:6:3:geometry:ingestionTimestamp,id:4:3:
  geomesa.stats.enable | true
  geomesa.z.splits     | 60
**Query Plan (through GeoMesa Cli):**
geomesa-hbase explain -c atlas -f OSMNodes -q "DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"

Planning 'OSMNodes' (DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00
  Original filter: (DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31') AND nextTimestamp > '2020-05-27 16:59:31'
  Hints: bin[false] arrow[false] density[false] stats[false] sampling[none]
  Sort: none
  Transforms: none
  Strategy selection:
    Query processing took 17ms for 1 options
    Filter plan: FilterPlan[Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00][nextTimestamp > 2020-05-27T16:59:31+00:00]]
    Strategy selection took 1ms for 1 options
  Strategy 1 of 1: Z3Index(geometry,ingestionTimestamp)
    Strategy filter: Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00][nextTimestamp > 2020-05-27T16:59:31+00:00]
    Geometries: FilterValues(List(POLYGON ((-122.3317610175119 47.607282, -122.33177379496394 47.60715226835226, -122.33181163628976 47.607027522218985, -122.33187308726842 47.606912555524126, -122.33195578637329 47.606811786373285, -122.33205655552413 47.60672908726843, -122.33217152221899 47.60666763628976, -122.33229626835225 47.606629794963936, -122.332426 47.606617017511894, -122.33255573164774 47.606629794963936, -122.33268047778101 47.60666763628976, -122.33279544447586 47.60672908726843, -122.33289621362671 47.606811786373285, -122.33297891273158 47.606912555524126, -122.33304036371024 47.607027522218985, -122.33307820503606 47.60715226835226, -122.3330909824881 47.607282, -122.33307820503606 47.60741173164774, -122.33304036371024 47.60753647778101, -122.33297891273158 47.60765144447587, -122.33289621362671 47.60775221362671, -122.33279544447586 47.60783491273157, -122.33268047778101 47.60789636371024, -122.33255573164774 47.60793420503606, -122.332426 47.6079469824881, -122.33229626835225 47.60793420503606, -122.33217152221899 47.60789636371024, -122.33205655552413 47.60783491273157, -122.33195578637329 47.60775221362671, -122.33187308726842 47.60765144447587, -122.33181163628976 47.60753647778101, -122.33177379496394 47.60741173164774, -122.3317610175119 47.607282))),true,false)
    Intervals: FilterValues(List((-∞,2020-05-27T16:59:31Z]),true,false)
    Plan: ScanPlan
      Tables: atlas_OSMNodes_z3_geometry_ingestionTimestamp_v6
      Ranges (7440): [%00;%0a;E$A%08;%00;%00;%00;%00;%00;::%00;%0a;E$A%0c;], [%01;%0a;E$A%08;%00;%00;%00;%00;%00;::%01;%0a;E$A%0c;], [%02;%0a;E$A%08;%00;%00;%00;%00;%00;::%02;%0a;E$A%0c;], [%03;%0a;E$A%08;%00;%00;%00;%00;%00;::%03;%0a;E$A%0c;], [%04;%0a;E$A%08;%00;%00;%00;%00;%00;::%04;%0a;E$A%0c;]
      Scans (120): ['%0a;ElA%98;%00;%00;%00;%00;%00;::'%0a;Ema%8c;], [:%0a;ElA%98;%00;%00;%00;%00;%00;:::%0a;Ema%8c;], [%14;::%14;%0a;ElA%8c;], [(%0a;ElA%98;%00;%00;%00;%00;%00;::(%0a;Ema%8c;], [%12;%0a;ElA%98;%00;%00;%00;%00;%00;::%12;%0a;Ema%8c;]
      Column families: d
      Remote filters: MultiRowRangeFilter, Z3HBaseFilter[(epoch,2629:2629),(zt,0:2009670),(zxy,335934:1603233:335941:1603248)], CqlFilter[(DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00]
    Plan creation took 135ms
  Query planning took 433ms

Other Query [1]

While doing experiement when I added lower timestamp it reduced the latency from 2-3 hours to few 10-20 minutes.

geomesa-hbase explain -c atlas -f OSMNodes -q "DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND ingestionTimestamp >= '2019-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"


Planning 'OSMNodes' ((DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND ingestionTimestamp >= 2019-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00
  Original filter: ((DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31') AND ingestionTimestamp >= '2019-05-27 16:59:31') AND nextTimestamp > '2020-05-27 16:59:31'
  Hints: bin[false] arrow[false] density[false] stats[false] sampling[none]
  Sort: none
  Transforms: none
  Strategy selection:
    Query processing took 24ms for 1 options
    Filter plan: FilterPlan[Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)][nextTimestamp > 2020-05-27T16:59:31+00:00]]
    Strategy selection took 2ms for 1 options
  Strategy 1 of 1: Z3Index(geometry,ingestionTimestamp)
    Strategy filter: Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)][nextTimestamp > 2020-05-27T16:59:31+00:00]
    Geometries: FilterValues(List(POLYGON ((-122.3317610175119 47.607282, -122.33177379496394 47.60715226835226, -122.33181163628976 47.607027522218985, -122.33187308726842 47.606912555524126, -122.33195578637329 47.606811786373285, -122.33205655552413 47.60672908726843, -122.33217152221899 47.60666763628976, -122.33229626835225 47.606629794963936, -122.332426 47.606617017511894, -122.33255573164774 47.606629794963936, -122.33268047778101 47.60666763628976, -122.33279544447586 47.60672908726843, -122.33289621362671 47.606811786373285, -122.33297891273158 47.606912555524126, -122.33304036371024 47.607027522218985, -122.33307820503606 47.60715226835226, -122.3330909824881 47.607282, -122.33307820503606 47.60741173164774, -122.33304036371024 47.60753647778101, -122.33297891273158 47.60765144447587, -122.33289621362671 47.60775221362671, -122.33279544447586 47.60783491273157, -122.33268047778101 47.60789636371024, -122.33255573164774 47.60793420503606, -122.332426 47.6079469824881, -122.33229626835225 47.60793420503606, -122.33217152221899 47.60789636371024, -122.33205655552413 47.60783491273157, -122.33195578637329 47.60775221362671, -122.33187308726842 47.60765144447587, -122.33181163628976 47.60753647778101, -122.33177379496394 47.60741173164774, -122.3317610175119 47.607282))),true,false)
    Intervals: FilterValues(List([2019-05-27T16:59:31Z,2020-05-27T16:59:31Z]),true,false)
    Plan: ScanPlan
      Tables: atlas_OSMNodes_z3_geometry_ingestionTimestamp_v6
      Ranges (404100): [%00;%0a;4$A%08;%00;%00;%00;%00;%00;::%00;%0a;4$A%0c;], [%01;%0a;4$A%08;%00;%00;%00;%00;%00;::%01;%0a;4$A%0c;], [%02;%0a;4$A%08;%00;%00;%00;%00;%00;::%02;%0a;4$A%0c;], [%03;%0a;4$A%08;%00;%00;%00;%00;%00;::%03;%0a;4$A%0c;], [%04;%0a;4$A%08;%00;%00;%00;%00;%00;::%04;%0a;4$A%0c;]
      Scans (4080): [2%0a;/la%08;%00;%00;%00;%00;%00;::2%0a;0da%9c;], [%18;%0a;$mA%08;%00;%00;%00;%00;%00;::%18;%0a;%eA%9c;], [%03;%0a;@%e%08;%00;%00;%00;%00;%00;::%03;%0a;@me%9c;], [%0f;%0a;%eE%08;%00;%00;%00;%00;%00;::%0f;%0a;&-E%9c;], ['%0a;Bda%08;%00;%00;%00;%00;%00;::'%0a;C,a%9c;]
      Column families: d
      Remote filters: MultiRowRangeFilter, Z3HBaseFilter[(epoch,2577:2629),(zt,1410483:2097151,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0:2009670),(zxy,335934:1603233:335941:1603248)], CqlFilter[(DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)) AND nextTimestamp > 2020-05-27T16:59:31+00:00]
    Plan creation took 475ms
  Query planning took 813ms

What is the reasoning behind having this much difference?

Amit
  • 59
  • 6

1 Answers1

1

First, thanks for including the indexing info and the query explainer output. That helps us answer more readily.

When a z3 index is used, if there is only an upper (or similar lower) bound on the date range, then ranges in index space are implicated. For each split, the same pattern of z3 ranges will be scanned, so having 60 splits will lead to lots of ranges to scan, and those ranges will likely be spread over the HBase cluster.

There are a few possible things to try: 1. Reingest with fewer ranges 2. Add a z2 (spatial) index to help with these types of queries (the spatial predicate will return few records which would then be further filtered) 3. Figure out if you can add a lower temporal bound (admittedly, this may not be possible. In some use cases, it does make sense.)

GeoJim
  • 1,320
  • 7
  • 12
  • What if I run OtherQuery[1] instead of above one, where the ingestion timestamp is always ingestionTimestamp >= '2019-05-27 16:59:31' ? Will it improve performance? Query Plan is attached above. – Amit Jun 09 '20 at 18:46
  • ingestionTimestamp <= '2020-05-27 16:59:31' AND ingestionTimestamp >= '2019-05-27 16:59:31' logically means that ingestionTimestamp is exactly '2020-05-27 16:59:31'. If you want to query by exact dates like that, you might consider having an attribute index on ingestionTimestamp. (You may be able to have z2 as a compound 'secondary' index. That'd let you make queries for an exact time and geometry about as fast as you could. – GeoJim Jun 09 '20 at 20:56
  • Other than that, I'd guess that performance of OtherQuery[1] could be faster since it should scan less data. That said, the number of ranges is very high compared to the number of scans. This means that some unnecessary data is likely be scanned. GeoMesa tries to avoid creating too many scans. It joins ranges to achieve this goal. Having 60 splits is likely contributing to slow down in these cases. That said, that's just one variable to experiment with! – GeoJim Jun 09 '20 at 20:59