5

I've been saving some tick data in TimescaleDB and have been surprised at how much space it's been taking up. I'm pretty new to this but I'm saving roughly 10 million rows a day into a table that has the following columns:

enter image description here

This has been taking up roughly 35GB a day, which seems excessive. I was wondering what steps I can take to reduce this amount - if I changed the doubles column to float, would this have a big impact? Are there any other ways to reduce this size?

EDIT:

The results of running chunk_relation_size_pretty() are:

enter image description here

and hypertable_relation_size_pretty():

enter image description here

It also seems very strange that the index is taking up so much space - I tried querying the data over a certain range of data and the results took quite a while to get back (roughly 10 minutes for a day's worth of data). The index is currently set as a composite index between (instrument, exchange, time DESC).

HHH
  • 149
  • 2
  • 12
  • We need more data: what is the table definition? What does `pgstattuple` say about the table? – Laurenz Albe Jun 04 '20 at 06:20
  • Can you post table schema, indexes and actual sizes, which can be obtained by [`hypertable_relation_size_pretty()`](https://docs.timescale.com/latest/api#hypertable_relation_size_pretty) and [`chunk_relation_size_pretty()`](https://docs.timescale.com/latest/api#chunk_relation_size_pretty) – k_rus Jun 04 '20 at 06:34
  • @k_rus sure - I've added more info in the question. – HHH Jun 07 '20 at 22:59
  • I feel it is something more ongoing in the database, but I didn't come up what to look for. Can you run `VACUUM ANALYZE`? Do you know if it was run? – k_rus Jun 08 '20 at 07:20
  • It is weird to see that price and quantity are stored as double. Seems like direct translation from Javascript. I would expect that quantity is an integer and price is a decimal or number. – k_rus Jun 08 '20 at 07:22
  • Did you come up with anything, apart from compression? I'm having the same "issue". – Benjen Oct 06 '20 at 05:40

2 Answers2

5

You should turn on TimescaleDB's native compression:

https://docs.timescale.com/latest/using-timescaledb/compression

Mike Freedman
  • 1,692
  • 9
  • 9
  • Thanks - I'll test this. It seems like this will help but I'm not sure if this is the main problem I'm having. Are timescale databases expected to take up much more space normally then (given this is turned off by default). – HHH Jun 05 '20 at 01:59
  • 1
    TimescaleDB basic table structure (uncompressed) is basically identical to Postgres, so that's the basic overhead for your row structure. If you do the math, each row takes up roughly the width of each column, plus roughly 27 bytes for additional metadata (e.g. MVCC versioning). `create_hypertable` also by default creates an index (btree) on timestamp; you should double check that you don't have the default index + your composite. But it's not surprising that the composite index may be large -- you probably have a very large count of instructure/exchange/timestamp. – Mike Freedman Jun 05 '20 at 04:51
  • @MikeFreedman i was experimenting with timescale db for timeseries data. I took stock daily data. Total ~28 M rows. Postgres took 1.8 GB of space. timescaledb table using create_hypertable() took 2.2 GB (using hypertable_size('table_name'). which is much higher than postgre. Whats even more surprising when i put compression on with segment_by on ticker the size blow up to 4.0 GB. chunk_compress_stats() shows every chunk is roughly 80% more on size. – David Mar 06 '21 at 01:17
  • @MikeFreedman details are captured here https://medium.com/p/68405425827 – David Mar 06 '21 at 02:22
  • 1
    @David - Left comment in Medium: If I understand what's happening: Your chunks are likely way too small (or your data too sparse). The "default" time interval per chunk is one week. If you are only taking 1 datapoint per day per stock, that means that each "segmentby" group is 7 items, so you are probably getting more overhead from the various array types we're using as part of compression, versus the compressibility itself. We typically recommend at least 100s of rows per distinct segmentby item per chunk. The additional overhead from TimescaleDB vs. Postgres is likely that as well. – Mike Freedman Mar 06 '21 at 05:07
  • One simple way to test: `SELECT count(*) from timescaledb_information.chunks WHERE hypertable_name = '‘stock_price_hyper’;` Please see here for best practices: https://docs.timescale.com/latest/using-timescaledb/compression#how-it-works https://docs.timescale.com/latest/using-timescaledb/hypertables#best-practices – Mike Freedman Mar 06 '21 at 05:07
  • Thanks i have updated the post and corrected it. Thanks again for your prompt and expert advice. BTW: Can this be automatically detected and flagged to user? – David Mar 06 '21 at 06:01
0

Try storing the data in other time series databases such as InfluxDB or VictoriaMetrics (I'm the core developer of VictoriaMetrics). They may provide better on-disk compression than TimescaleDB according to benchmarks.

valyala
  • 11,669
  • 1
  • 59
  • 62