1

I am running a Python script which processes time-series data for a number of different metrics, and then writes the results to a Postgres database.

The time-series assumes 40 epochs, stored as a real[40] array column in the database.

When writing the outputs for all 40 epochs to the table in one shot, (batch update across all rows), everything seemed to work fine. i.e.

UPDATE my_table SET
  arr_col_1 = {1, 2, 3, ... 40},
  arr_col_2 = {1, 2, 3, ...40},
  ...
  arr_col_90 = {1, 2, 3, ...40};

However, iteratively writing the results of the respective epochs to each position in the array seems to chew up all free space on the hard drive, e.g.

UPDATE my_table SET
  arr_col_1[1] = 1,
  arr_col_2[1] = 1,
  ...
  arr_col_90[1] = 1;

UPDATE my_table SET
  arr_col_1[2] = 2,
  arr_col_2[2] = 2,
  ...
  arr_col_90[2] = 2;

-- repeat x 38 more times

The reason for the iterative strategy is to accommodate larger quantities of rows, for which the results for 40 epochs don't fit into memory at the same time.

To my knowledge, UPDATE queries will delete and rewrite row data in certain situations, but I'm not clear on when this happens and how this possibly relates to arrays. Is there a way to iteratively update arrays across large numbers of rows without leading to database bloat?

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
songololo
  • 4,724
  • 5
  • 35
  • 49
  • 1
    your database design seems wrong to me. Can you share the full table schema & what relationships exist between this table and other tables. Loosely speaking, you should reduce the database schema to 3NF for efficient writing. – Haleemur Ali Mar 11 '19 at 00:03
  • Postgres supports arrays, true. But, it's not a feature that we must *(ab)use* to store data like this, by ignoring more appropriate relational database rules which are fundamental to the design of any conventional schemas/table structures. – Kaushik Nayak Mar 11 '19 at 03:28
  • I don't need to query the epochs individually so is it really better to spread the epochs across columns? Or should I be using JSONB instead? – songololo Mar 11 '19 at 08:38

2 Answers2

4

As others have correctly mentioned, this approach is not well suited to PostgreSQL's mode of operation.

However, you may be able to use an optimization called HOT:

  • Declare your table with a fillfactor less than 100 so that INSERTs leave free space in each block:

    ALTER TABLE my_table SET (fillfactor = 50);
    

    This setting only affects future activity, you'd have to reorganize the table for it to affect existing data. If you update every row in the table, you may need a setting as low as 30 for it to be effective.

  • Make sure the columns that are updated do not have an index on them.

Then PostgreSQL can use &ldquo ;HOT update” and reclaim the dead table entries on the fly, which avoids the need for autovacuum, which obviously cannot keep up on your table.

Check the n_tup_hot_upd column in the pg_stat_user_tables row for your table to see if it is working.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thank you, I think this is what I need as I don't currently see a way around frequent updates. Ideally I would have looked into a different DB, but need postgis for certain complex spatial queries. – songololo Mar 11 '19 at 11:17
2

Postgres uses MVCC, which does copy-on-write.

The UPDATE copies the whole row to a new one and the old one is marked for deletion, but the deletion itself only takes place during a vacuum, which happens periodically by the autovacuum daemon.

You can free up the space yourself by running

VACUUM

How much disk space do you have that it runs out? I've never heard of such issue with a non-gigantic database.

isapir
  • 21,295
  • 13
  • 115
  • 116
  • 1
    Thank you for the clear explanation. Have been running VACUUM to clear-up the bloat. There are several hundred thousand rows so presumably the iterative updates to the array prevents auto vacuum from doing its normal job. – songololo Mar 11 '19 at 08:45