0

I am trying to test geomesa cassandra backend.

I have ingested a ~2M points from OSM and send DWITHIN and BBOX queries to cassandra using geomesa with geotools ecql.

Then I've done some performance tests, the results do not look reasonable for me.

Cassandra is installed to linux machine with 16 cores xeon, 32GB RAM and 1 SSD drive. I got ~150 queries per second.

I started to investigate geomesa execution plan for my queries.

Trace logs coming from org.locationtech.geomesa.index.utils.Explainer were really helpful, they do a great job explaining what is going on.

What looks confusing to me is the number of range scans that go though cassandra.

For example, I see the following in my logs: Table: osm_poi_a7_c_osm_5fpoi_5fa7_attr_v2 Ranges (49): SELECT * FROM ..

The number 49 means the actual number of range scans sent to cassandra. Different queries give me different results, they vary approximately from ~10 to ~130.

10 looks quite reasonable to me but 130 looks enormous.

Could you please explain what causing geomesa to send so huge amount of range scans?

Is there any way to decrease the number of range scans?

Maybe there are some configuration options?

Are there other options? like descreasing the presicion of z-index to improve such queries?

Thanks anyway!

d-n-ust
  • 122
  • 1
  • 9

1 Answers1

2

In general, GeoMesa uses common query planning algorithms among its various back-end implementations. The default values are tilted more towards HBase and Accumulo, which support scans with large numbers of ranges. However, there are various knobs you can use to modify the behavior.

You can reduce the number of ranges that are generated at runtime through the system property geomesa.scan.ranges.target (see here). Note that this will be a rough upper limit, so you will generally get more ranges than specified.

When creating your simple feature type schema, you can also disable sharding, which defaults to 4. The number of ranges generated will be multiplied by the number of shards. See here and here.

If you are querying multiple 'time bins' (weeks by default), then the number of ranges will be multiplied by the number of time bins you are querying. You can set this to a longer interval when creating your schema; see here.

Thanks,

Emilio Lahr-Vivaz
  • 1,439
  • 6
  • 5
  • Sharding and 'time bins' were quite obvious for me, but not `geomesa.scan.ranges.target`. Thanks for pointing out! Will try to adjust that and see if it helps. By the way, it seems that `sft.getUserData().put("geomesa.attr.splits", "4");` does not work for cassandra. At least for my sft which contains the attribute with index I can't see a special `shard` column created. Is it supposed to work another way? – d-n-ust Apr 19 '18 at 23:30
  • No, sorry, you are correct. Attribute shards has not been implemented for Cassandra yet. – Emilio Lahr-Vivaz Apr 20 '18 at 12:22
  • Looks like `geomesa.scan.ranges.target` worked for me. I set it to 1 and now the number of range scans is from 1 to 9 (usually 4-6) which improved the overall QPS rate from ~150 to ~350. – d-n-ust Apr 20 '18 at 13:56