1

I am looking for a NoSQL technology that meets the requirement of being able to process geospatial as well as time queries on a large scale with decent performance. I want to batch-process several hundred of GBs to TBs of data with the proposed NoSQL technology along with Spark. This will obviously be run on a cluster with several nodes.

Types of queries I want to run:

  • "normal" queries for attributes like "field <= value"
  • Basic geospatial queries like querying all data that relies within a bbox.
  • Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"
  • a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")

I am currently evaluating which technologies are possible for my usecase but I'm overwhelmed by the sheer amount of technologies there are available. I have thought about popular technologies like MongoDB and Cassandra. Both seem to be applicable for my usecase (Cassandra only with Stratios Lucene index) but there might be a different technology that works even better.

Is there any technology that will heavily outperform others based on these requirements?

Community
  • 1
  • 1
j9dy
  • 2,029
  • 3
  • 25
  • 39
  • Which technology did you decide to go with? – tbsalling Aug 21 '18 at 06:46
  • 1
    @tbsalling MongoDB outperformed Cassandra in terms of write performance because the Stratios Lucene Index could not keep up with the ingest speed. It was stressing our nodes too much and we could not get it tuned in an acceptable way at that point in time. Things might have changed, it has been two years. MongoDB was a good choice though :-) – j9dy Aug 21 '18 at 07:55

1 Answers1

2

I want to batch-process several hundred of GBs to TBs of data

That's not really a cassandra use case. Cassandra is firstly optimized for write performance. If you have a really huge amount of writes, Cassandra could be a good option for your. Cassandra isn't a database for Exploratory queries. Cassandra is a database for known queries. On read level Cassandra is optimized for sequentiell reads. Cassandra can only query data sequentially. It's also possible to ignore this but it's not recommended. Huge amount of data could be, with the wrong data model, a problem in Cassandra. Maybe a hadoop based database system is a better option for your.

Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"

Cassandra is really good for time series data.

"normal" queries for attributes like "field <= value"

If you know the queries before you modeling you database, Cassandra is also a good choice.

a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")

Cassandra could be a good solution. Why could? As i said: You have to know this queries before you create your tables. If you know that you will have thousands of queries where you need a time range and the location (city, country, content etc.) it is a good solution for your.

time queries on a large scale with decent performance.

Cassandra will have the best performance in this use case. The data are already in the needed order. MonoDB is a nice replacement for MySQL use cases. If you need a better scale, but scaling mongodb is not so simple as in Cassandra, and flexibly and you care about the consistency. Cassandra has eventual consistency is scalable and performance is really important. MongoDB has also relations, Cassandra not. In Cassandra is everything denormalized because performance cares.

Citrullin
  • 2,269
  • 14
  • 29
  • I have thought about a cassandra column family that contains: sensor_id, timestamp, location (not available in every record!), key, value. then have a clustering key on my field "key" so i can have multiple key/values for each logical log entry. When querying for a location, I always need to pull out more data, based on the timestamp of the returned timestamps of the geoquery. For example, if my geoquery returns a record with date "25.06.2016-21:18:30" I also want to read the last -5 and +5 minutes. Thats where sequential reads might come in really handy. Theres a problem I see. [1/2] – j9dy Jun 25 '16 at 19:30
  • Not all my log entries contain the location. So when I query for the location, for example with a "within bbox"-query, I might get a single entry that contains a location. This would require me to first run the geoquery, let it complete and after that take the date/time-field of each returned record and read a sequential chunk based on the -5 and +5 minutes of each date returned by the geoquery. Then I'd have the data I really need. Also I need to filter on the "key" field, for example "where key = velocity OR key = whatever". Is this a problem? Is there a way to speed this up? – j9dy Jun 25 '16 at 19:31