0

I have a Postgres database containing many tables, each containing many rows, with many relations of all types between these different tables.

If I add a new column should_delete to all of these tables, is there some automatic way I can have all rows WHERE should_delete = true deleted, without having to manually specify each table that rows should be deleted from, or manually order queries so that rows are deleted before other rows referencing them (and so avoid errors due to constraints failing, e.g. by deleting a row referenced by a foreign key before the row containing the foreign key, even if both rows contain should_delete = true and so should be deleted)?

This should ultimately result in all rows marked should_delete = true being deleted, and no rows marked should_delete = false being deleted, but should only result in an error if there's no way to delete only rows marked should_delete = true without violating a constraint.

If there's no pure Postgres way to achieve this, I am also using Python and Django to access this database, and so it would also be acceptable if there is a Python package that could be used to do this.

Bob Whitelock
  • 167
  • 3
  • 12
  • If you have ```FOREIGN KEY``` relationships between the tables why not use ```ON DELETE CASCADE```? Then you delete from the parent and the child records are deleted. – Adrian Klaver Jul 31 '20 at 17:35
  • @AdrianKlaver A few reasons I'm cautious of doing that: I don't want to necessarily delete every record referencing the parent, only records with `should_delete = true`; I don't necessarily know what the parent to start from will be in each case, it could be some arbitrary set of rows referencing each other in different tables; and this behaviour is a special case I want to do for a single feature/within a single transaction, I would find it quite risky to have all my data delete on cascade in all other situations if some top-level parent record was to be deleted. – Bob Whitelock Jul 31 '20 at 17:48
  • Honestly, I think this whole thing is a recipe for disaster and a sign of poor planning. You run the very great risk of forgetting what records have ```should_delete = true``` and removing records you don't want to when conditions change. I favor the explicit approach of deleting the records you want by selecting them in each query instead of relying on some information that you may not remember was set. – Adrian Klaver Jul 31 '20 at 18:02
  • @AdrianKlaver is right, but lets dig a bit into this problem :-) This means if you have TableA depending on TableB but only in TableA record has should_delete = True that means that you should not delete this record since record in TableB has should_delete is False (or empty or whatever) - is that what you try to achieve? – jana Jul 31 '20 at 18:16
  • Thanks for your thoughts @AdrianKlaver - I think without knowing the full details of what constraints there are, how much the data matters etc. it's hard to say how good a particular database design is. This is an intentionally minimal description of what I'm trying to do just to see whether there's some easy way to achieve this, and I'm considering other options too. – Bob Whitelock Jul 31 '20 at 23:18
  • To expand more, I basically have some data that is important and some that is transient; both are in the same format in all other respects and will be tagged with `should_delete` appropriately when created, after which this column will never change. It would be useful if I could just store all this data in one database and delete the transient data automatically periodically based on this flag, while not touching the rest of the data. I could use two databases for the two sets of data but this would make things a lot more complicated within my application, and it would be useful for ... – Bob Whitelock Jul 31 '20 at 23:21
  • ... transient data to be able to reference the non-transient data. @jana that sounds correct - everything tagged with `should_delete = true` is fine to delete, but nothing with `should_delete = false` should be deleted, and doing so would indicate an application bug. If it's not possible to delete something tagged `should_delete = true` due to relations with constraints with things tagged `should_delete = false`, this would also indicate a bug which should be surfaced and addressed. – Bob Whitelock Jul 31 '20 at 23:21
  • Why not put them in two tables each in a separate schema? You could combine the tables with a view for an overall look at the data. – Adrian Klaver Aug 01 '20 at 00:35
  • @AdrianKlaver There's hundreds of tables, and I want to have some transient and some non-transient rows in each of them – Bob Whitelock Aug 01 '20 at 00:50

1 Answers1

0

Try following select to get list of tables which contain your column

SELECT table_name
FROM information_schema.columns
WHERE column_name = 'should_delete'
ORDER BY table_schema, table_name;

loop over the result and proceed with your delete statement for each table

I took inspiration from this post: https://kb.objectrocket.com/postgresql/postgres-list-tables-with-python-1023

The second possibility is to generate SQL file with actual delete statements when you get the result from the select - something like this:

with open("delete_statements.sql", "w") as myFile:
    for table in resultFromPostgresqlSelect:
        myFile.write("DELTE FROM {} WHERE should_delete = true;\n".format(table))

EDIT - after few comments and clarifications

What you want is to connect to PostgreSQL using python and run following query

SELECT
    tc.table_schema,
    tc.constraint_name,
    tc.table_name,
    kcu.column_name,
    ccu.table_schema AS foreign_table_schema,
    ccu.table_name AS foreign_table_name,
    ccu.column_name AS foreign_column_name
FROM
    information_schema.table_constraints AS tc
    JOIN information_schema.key_column_usage AS kcu
      ON tc.constraint_name = kcu.constraint_name
      AND tc.table_schema = kcu.table_schema
    JOIN information_schema.constraint_column_usage AS ccu
      ON ccu.constraint_name = tc.constraint_name
      AND ccu.table_schema = tc.table_schema
WHERE constraint_type = 'FOREIGN KEY';

It will give you following list of foreign keys and list of tables where they are - and what is more important - column names which references to those tables

 table_schema | constraint_name | table_name |     column_name      | foreign_table_schema | foreign_table_name | foreign_column_name
--------------+-----------------+------------+----------------------+----------------------+--------------------+---------------------
 public       | fk_table_a_id   | table_b    | reference_table_a_id | public               | table_a            | id

From this you go to through each foreign_table (in this case table_a) and do select of all IDs where should_delete=true and then do the same for your referenced tables to this table. Once you have those lists (in my case two lists - one list of IDs from table_a and one list of IDs from table_b) you will join them and find only those which are on both sides - those rows you can delete from both tables

In my case i would generate those two SQLs and get result for them

SELECT id FROM table_a WHERE should_delete = true;

SELECT reference_table_a_id FROM table_b WHERE should_delete = true;

Then i can match those IDs and i have list for deletion from table_a and table_b

Or even better - you generate delete query which will have joins between tables - see following post for details PostgreSQL delete with inner join

This is pythonic/dynamic way how to solve your problem. Let me know if you have any questions/suggestions


adding definition of those two tables below

CREATE TABLE table_a (id INT GENERATED ALWAYS AS IDENTITY, something int, should_delete int DEFAULT 0,
PRIMARY KEY(id));

CREATE TABLE table_b (id INT GENERATED ALWAYS AS IDENTITY, something_else int, should_delete int DEFAULT 0, reference_table_a_id INT,
PRIMARY KEY(ID), 
CONSTRAINT fk_table_a_id
 FOREIGN KEY(reference_table_a_id) 
      REFERENCES table_a(id)
 );
jana
  • 166
  • 1
  • 3
  • Thanks for the suggestion - I think there's 2 tricky parts to this problem: dynamically finding the tables that should be deleted from, and then deleting from these in the correct order so constraints aren't violated in the process (if this is possible). This answers the first part but I think I'd still run into issues with the second part if things were deleted in the wrong order, as there's nothing to ensure statements are run in the optimal order. – Bob Whitelock Jul 31 '20 at 17:53