DELETE excess rows per group

Question

I have a Postgres table with these columns:

  id          int8
, user_id     varchar
, is_favorite boolean
, join_time   timestamptz

I want to delete some rows in this table with some conditions:

Keep a maximum of 10 rows for each user_id.
These 10 rows must contain each user_id's rows with is_favorite=true
(There can't be more than 5 rows with is_favorite=true per user_id.)
The rest of 10 rows must be the ones with the latest join_time.

I want to delete rows past the 10 per user_id in this table.

Example

id|user_id                             |is_favorite|join_time                    
------------------------------------+------------------------------------+-------
1 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-04 15:16:40.000 +0300
2 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-03 15:16:25.000 +0300
3 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-02 15:16:40.000 +0300
4 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-04-22 15:16:40.000 +0300
5 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-23 15:16:25.000 +0300
6 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-21 15:16:25.000 +0300
7 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-20 15:16:40.000 +0300
8 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-19 15:16:25.000 +0300
9 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-18 15:16:40.000 +0300
10|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-17 15:16:25.000 +0300
11|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-16 15:16:40.000 +0300
12|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-15 15:16:25.000 +0300
13|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:40.000 +0300
14|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:39.000 +0300
15|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 12:16:25.000 +0300
16|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 11:16:25.000 +0300
17|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 10:16:25.000 +0300
18|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 09:16:25.000 +0300
19|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 08:16:25.000 +0300
20|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 07:16:25.000 +0300
21|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 06:16:25.000 +0300
22|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 05:16:25.000 +0300
23|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 04:16:25.000 +0300
24|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 03:16:25.000 +0300
25|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 02:16:25.000 +0300
26|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 01:16:25.000 +0300

For user_id = 655caab8-ce81-11ed-afa1-0242ac120002 these IDs must be deleted: 11,12,13,14

For user_id = 81c126b6-ce81-11ed-afa1-0242ac120002 these IDs must be deleted 25,26.

Do you expect to delete (many) more rows than you keep? In that case: are you at liberty to create a new table with surviving rows to replace the old? Always declare your version of Postgres, please. An actual, complete `CREATE TABLE` script, also showing constraints, is much more helpful than any free-hand description. — Erwin Brandstetter, Mar 29 '23 at 23:46

Erwin Brandstetter · Accepted Answer · 2023-03-30T00:01:10.880

Since you are processing the whole table, using a simple subquery with row_number() should be fastest:

DELETE FROM tbl t
USING (
   SELECT id, row_number() OVER (PARTITION BY user_id
                                 ORDER BY is_favorite DESC, join_time DESC
                                 ROWS UNBOUNDED PRECEDING) AS rn
   FROM   tbl t
   ) del
WHERE  t.id = del.id
AND    del.rn > 10;

Adding ROWS UNBOUNDED PRECEDING is optional, but should make it substantially faster (until Postgres 16 is released). See:

Efficient downsampling of a selected timeseries to equidistant samples

Applying the right sort order, this skips the top 10 of most desirable rows per user and deletes the rest.

true sorts before false in descending order. See:

PostgreSQL: order by column, with specific NON-NULL value LAST

If there can be null values, you need to do more. Like, first of all clarify your question.

Obviously, there would be race conditions with concurrent writes. If there can be concurrent write load, take a write lock on the table in the same transaction first ...

If that's going to delete the majority of rows, it may be cheaper to create a new table of survivors instead ...

There are other ways. Like:

Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

Aside: use type uuid for your user_id column. Much better. See:

great, thanks for quick response @ErwinBrandstetter . This script will execute 8 times in about 5 minutes in this table by adding `del.user_id in (x, y, z)` to the script.Table contains about 200K rows. Could this cause a performance issue?(version:PostgreSQL 12.10) — ee_engineer, Mar 30 '23 at 16:11
@ee_engineer: For just 3 given `user_id`, a different query will be far more efficient. You should have said so. Basically, just push the additional `WHERE` clause down into the subquery. Maybe start a new question with your exact case. — Erwin Brandstetter, Mar 30 '23 at 22:44
thanks, i have opened new topic: https://stackoverflow.com/questions/75895419/delete-excess-rows-per-group-with-criteria — ee_engineer, Mar 31 '23 at 06:24

score 0 · Answer 2 · answered Mar 30 '23 at 00:22

You can use row_number() twice. first one used to remove records above rank 10, and second one will be used to remove any records is_favorite=true above rank 5.

with cte as (
  select *, row_number() over (partition by user_id order by is_favorite desc, join_time desc) as rn,
  (CASE WHEN is_favorite = 'true'
        THEN row_number() over (partition by user_id order by is_favorite desc)
        ELSE 0
  END) as fav_rn
  from mytable
)
delete from mytable
where id in (
  select id
  from cte
  where rn>10 or fav_rn > 5
)

Demo here

DELETE excess rows per group

Example

2 Answers2

Linked