0

I am looking at ways to improve deletion of data in my Postgres (partitioned) tables, not worsening the access performance.

Using: Postgres 10.2

Ignoring some unrelated columns, I have the table transactions with these columns (omitting some unrelated columns):

transaction_id PK
location
type
user_id
transaction_date

Some important points about the current table:

  1. In production, it has around 100 million rows
  2. The table is partitioned (manually) based on the user_id (modulo 100). This means that the transactions for say user with user_id 3 will go to transactions_3 and with user_id 2356 will go to transactions_56.
  3. We insert the records manually since Postgres(10) doesn't support this type of partitioning out of the box and since we already know the user for which the transaction has to be inserted ( also upon retrieval)

What works well: Insertions and retrieval, since we already know the user - we know which table to look at and hence doesn't have to go through the 100 partitions to find it.

What does not: We have a process which deletes the old data quite often - based on users subscription. But this leads to problems (space issues) often as the data deleted doesn't get freed up immediately. As plain VACUUM may not be enough when a table contains large numbers of dead row versions as a result of massive update or delete activity (as is our case here)

How we would like to improve this is to be able to store the the data in partitions based on the transaction date - and then be able to drop the tables when the subscription is over. This will make sure that the space is usable again immediately.

In short, our main goal is to improve the deletion process so that the space is recovered immediately - at the same time making sure the access performance is not worsened

I have a few questions regarding this:

  1. If we have the table partitioned based on date, I would think this (access at least) is going to be slower as it now has to scan all the 100 tables to see where the transaction id is?
  2. Is it really possible to achieve this, keeping the retrieval of transactions as before - while improving the deletion process. If so, how?
  3. I would think having it partitioned both on the days and account is not really a possible\good solution - due to the large number of tables that could be created? (Need to keep data for a maximum of 2 years)
  4. For this, do we need to move to a newer Postgres, say Postgres 14 (it being the latest). I understand that it is always good to upgrade to the latest version. But I wanted to know - if it is really possible to do this without the Postgres upgrade.

Hoping to get some guidance here on the way forward.

Chillax
  • 4,418
  • 21
  • 56
  • 91
  • Why would a VACUUM not be sufficient, in your case? – jjanes Oct 13 '21 at 17:08
  • Maybe just stop partitioning. You have given some problems caused by partitioning which you worked around without too much trouble. But you haven't listed any actual benefits. So maybe just stop doing it. You would not need to "go through the 100 partitions" if you didn't have 100 partitions. – jjanes Oct 13 '21 at 17:58
  • @jjanes From the docs: Plain VACUUM may not be satisfactory when a table contains large numbers of dead row versions as a result of massive update or delete activity. WE do have massive delete activity. And I would think partitioning is required for larger data sets as in my case (expected to grow further) - and not partitioning could make the performance of retrieval of records worse ? – Chillax Oct 14 '21 at 07:48
  • The key word in the docs being "may". Why is making space available for internal re-use not sufficient for *you* in *this* situation? I know why it might not be sufficient in other situations, sure. – jjanes Oct 14 '21 at 14:47

1 Answers1

1

First: upgrading PostgreSQL would be a very good idea, not only because hash partitioning was introduced after v10, but also because of the numerous improvements in performance and features for partitioning since v10.

I have the feeling that the partitioning scheme you are using right now (home-grown hash partitioning) does not help you much. You cannot get rid of a customer with a simple DROP TABLE (which would be nice), and deleting 10 million rows in a partition is not any more fun than deleting them in a single large table. On the contrary – the relative bloat once autovacuum is done will be more. The only saving grace is that autovacuum will work more efficiently, since it can treat each partition on its own.

To answer your questions:

  1. Yes, partitioning makes most queries slower; hopefully not much slower. That is the price you are paying.

  2. No, your queries will get somewhat slower (proportional to the numbr of partitions, so keep that moderate).

  3. You can partition on both criteria, since a partition can again be a partitioned table. But I question if that is really a good idea, since I doubt that your current partitioning scheme is really beneficial.

  4. Yes, use at least v12, ideally v14.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Not sure if you have any experience with timescaledb , but do you think it would be an ideal candidate here? As we need to delete the data based on subscription as well as make sure access remains fast enough. – Chillax Oct 13 '21 at 14:17
  • TimescaleDB just uses PostgreSQL partitioning. I don't see how that would make a difference here. – Laurenz Albe Oct 13 '21 at 14:23
  • After I upgrade to 14, I would think (hash) partitioning based on the user_id to say 10 partitions would help in the access of the transactions with the user_id in the where condition of the query? – Chillax Oct 16 '21 at 22:20
  • I don't see much benefit there, unless your queries perform a sequential scan. An index scan on a large table is not slower than an index scan on a small one. – Laurenz Albe Oct 18 '21 at 05:59