I am looking at ways to improve deletion of data in my Postgres (partitioned) tables, not worsening the access performance.
Using: Postgres 10.2
Ignoring some unrelated columns, I have the table transactions
with these columns (omitting some unrelated columns):
transaction_id PK
location
type
user_id
transaction_date
Some important points about the current table:
- In production, it has around 100 million rows
- The table is partitioned (manually) based on the
user_id
(modulo 100). This means that the transactions for say user withuser_id
3 will go totransactions_3
and withuser_id
2356 will go totransactions_56
. - We insert the records manually since Postgres(10) doesn't support this type of partitioning out of the box and since we already know the user for which the transaction has to be inserted ( also upon retrieval)
What works well: Insertions and retrieval, since we already know the user - we know which table to look at and hence doesn't have to go through the 100 partitions to find it.
What does not: We have a process which deletes the old data quite often - based on users subscription. But this leads to problems (space issues) often as the data deleted doesn't get freed up immediately. As plain VACUUM may not be enough when a table contains large numbers of dead row versions as a result of massive update or delete activity (as is our case here)
How we would like to improve this is to be able to store the the data in partitions based on the transaction date - and then be able to drop the tables when the subscription is over. This will make sure that the space is usable again immediately.
In short, our main goal is to improve the deletion process so that the space is recovered immediately - at the same time making sure the access performance is not worsened
I have a few questions regarding this:
- If we have the table partitioned based on date, I would think this (access at least) is going to be slower as it now has to scan all the 100 tables to see where the transaction id is?
- Is it really possible to achieve this, keeping the retrieval of transactions as before - while improving the deletion process. If so, how?
- I would think having it partitioned both on the days and account is not really a possible\good solution - due to the large number of tables that could be created? (Need to keep data for a maximum of 2 years)
- For this, do we need to move to a newer Postgres, say Postgres 14 (it being the latest). I understand that it is always good to upgrade to the latest version. But I wanted to know - if it is really possible to do this without the Postgres upgrade.
Hoping to get some guidance here on the way forward.