3

I need to store a large number of structured records (potentially hundreds of billions) in a database. Data would be written continuously by many sensors, with a high insert rate (up to 100k rows/second).

The data being well structured, it seems to be a good match for a structured database such as Postgres. However, I am afraid the performance wouldn't be sufficient for the amount of data that needs to be ingested.

Furthermore, I do not need all of the features of a relational database (no need for full SQL support). Data would be written once, and read a few times as large chunks using basic queries such as:

SELECT time, value FROM data WHERE time>1000 AND time<2500 AND sensor_location="home" ORDER BY time

That is, select all records between two timestamps for a given sensor (or set of sensors). I do not need any ability to make complex queries such as joins or updates. The ORDER BY clause is important, since I need to be able to process these messages in the order they were written (using a Python script). These queries typically return many rows, and are often too large to fit in RAM. Furthermore, returning that many rows is very slow with most RDBMSs due to their text-based wire protocol, even if I split the query.

This seems to be a good use case for a time series database such as InfluxDB. However, its open source version cannot easily be distributed (which is a requirement in my case, both for resilience and scalability), and my tests showed that it is not performant enough when it comes to large queries (in particular, its wire protocol is too slow to efficiently transfer this many rows - and it sometimes even crashes when the query returns too many rows).

I recently learned about Clickhouse, which is horizontally scalable and highly performant. It has a binary/compressed wire protocol and one of the Python drivers (clickhouse_driver) has an execute_iter function which avoids blowing up the client's RAM when making these large queries. However, I am quite worried about its resilience (data corruption is not tolerable in my use case) since it is fairly recent and has a limited user base.

I am aware that my use case if quite specific. Are there other free/open source options I should be aware of?

Ben
  • 317
  • 3
  • 12
  • 1
    Have you considered MongoDB: https://www.mongodb.com/try/download/community? Nosql databases may be what you need since you don't need to make any complex relationships. – Dr. Mantis Tobbogan Jul 11 '20 at 20:11
  • Interesting. I have tried a few things with MongoDB. I also tried Arctic (https://arctic.readthedocs.io/en/latest/), which uses MongoDB as its backend. However, I feel that the document structure of MongoDB does not really fit my use case. The most natural solution would be to create one document for each sensor measurement, but this leads to very low performance. Arctic's model groups many measurements in a single document, but I feel like this is a stretch. – Ben Jul 11 '20 at 20:19
  • 2
    You can use Timescale extension for PostgreSQL https://www.timescale.com/ – Slava Rozhnev Jul 11 '20 at 20:21
  • 1
    I was just about to suggest Timescale as well, also Redis is worth a look if you have time. – Dr. Mantis Tobbogan Jul 11 '20 at 20:24
  • @SlavaRozhnev thanks for the suggestion! I'll look into it. Does Timescale have a fast wire protocol? – Ben Jul 11 '20 at 20:25
  • Isn't Redis meant for temporary storage/cache and not persistence? – Ben Jul 11 '20 at 20:29
  • 1
    Redis is not designed for large time-series storage. Look wikipedia for list popular time-series databses – Slava Rozhnev Jul 11 '20 at 20:32
  • 3
    CH has a replication feature and stores 4 various checksums (in case of data corruption or total lost it will fetched from replica). I manage CH clusters for 3 years now with petabytes of data. No data lost so far. You can use CH replication with several replicas 3 or 4 (replication factor). Also check https://github.com/VictoriaMetrics/VictoriaMetrics – Denny Crane Jul 11 '20 at 20:33
  • 100 krows/sec sounds very challenging. I wonder if you could achieve this by using `COPY ... FROM STDIN` in PostgreSQL. It could be that you have to be very careful with your timestamp field. You don't want your database spending time converting strings to datetime. As a fallback (and depending on your timing constraints), you might be able to stream to CSV file and later import to database. – bfris Jul 12 '20 at 20:46

3 Answers3

2

Look like your case is typical for ClickHouse please use ReplicatedMergeTree table engine https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/

Slach
  • 1,672
  • 13
  • 21
0

Take a look at VictoriaMetrics time series database. It easily handles 100k rows/sec ingestion performance on a single node with a few CPU cores. It is optimized for storing and querying trillions (10^12) of rows - see case studies. It also scales to multiple nodes - see docs for cluster version.

It also provides MetricsQL query language, which is optimized for typical time series queries in production. For instance, the following query would return time series for all the temperature sensors at home: temperature{sensor_location="home"}.

valyala
  • 11,669
  • 1
  • 59
  • 62
0

You should be aware of Warp 10. It is scalable and looks a good fit to your use case.

Since you process the messages using Python, the fact that is it well integrated with it should be relevant to you. It supports both Pickle and Arrow to wire data to Python. You can also distribute the processing using its integration with Spark.

user2682877
  • 520
  • 1
  • 8
  • 19