1

I have a Postgres database table (users) with the schema as shown below:

id  user_id status
1   1       Green
2   2       Green
3   1       Blue
4   1       Green

A few unexpected concurrent insert queries are being executed on this table as below:
(We don't have control over query triggers)

17:38:31.927 -> INSERT INTO members (user_id, status) VALUES (1, 'Green');
17:38:31.927 -> INSERT INTO members (user_id, status) VALUES (1, 'Green');

This results in two new records added to the table:

id  user_id status
1   1       Green
2   2       Green
3   1       Blue
4   1       Green
5   1       Green
6   1       Green

I want to avoid these duplicate records getting created. Also, the record with a user_id and status should not be created if the latest record from all the records with this same user_id is having this same status.

I cannot manage this logic at the code level as these are concurrent queries.

I have explored unique constraints and upsert in Postgres, but didn't find a solution for the above scenario.

EDIT -

Why cannot add a unique constraint on user_id and status?

Ans. There can be duplicate user_id and status. Suppose 'Green' is the latest status of user_id 1 (as shown in the first table) then we cannot insert status as 'Green' for the same user_id. But we can insert 'Blue' and then again we can insert 'Green', this way there will be two records with user_id as 1 and status as 'Green'. Similarly, there are multiple same queries getting triggered concurrently so if one of those queries executes first then other remaining queries should fail because they the have same user_id and status.

m0rem0rem0re
  • 139
  • 2
  • 10
  • 1
    Why is this question tagged with MySQL if you need a PostgreSQL solution? Also, why not use something like a `UNIQUE` index? – Nico Haase Mar 23 '22 at 13:07
  • 1
    "...if the latest record..." -- How do you define the "latest"? – The Impaler Mar 23 '22 at 13:10
  • @TheImpaler we can find the latest record using id, ids are auto-incremented. Thanks! – m0rem0rem0re Mar 23 '22 at 13:16
  • @ErgestBasha Thanks! checked now, `ON CONFLICT` is not helping in this scenario – m0rem0rem0re Mar 23 '22 at 13:21
  • Create a unique index on `(user_id, status)` –  Mar 23 '22 at 13:24
  • 1
    Create unique composite index by these 2 columns. The duplicates will never be inserted even in theory. – Akina Mar 23 '22 at 13:24
  • 1
    Can you please explain why you can't use a unique constraint? Are there situations when (user_id, status) should NOT be unique? – Jeremy Mar 23 '22 at 14:13
  • @Jeremy Thanks! There can be duplicate user_id and status. Suppose 'Green' is the latest status of user_id 1 (as shown in the first table) then we cannot insert status as 'Green' for the same user_id. But we can insert 'Blue' and then again we can insert 'Green', this way there will be two records with user_id as 1 and status as 'Green'. Similarly, there are multiple same queries getting triggered concurrently so if one of those queries executes first then other remaining queries should fail because they the have same user_id and status. – m0rem0rem0re Mar 23 '22 at 16:32
  • @a_horse_with_no_name Thanks! please refers above comment to know the reason why we cannot add a unique index on (user_id, status). – m0rem0rem0re Mar 23 '22 at 16:36
  • Please add all clarification to your question by editing it – Nico Haase Mar 23 '22 at 16:37

1 Answers1

1

If when you added a unique constraint to the table, then this constraint automatically creates a unique index too. But, during the process of creating an index DB will check the all data in the entire table to detect if the data matches the generated index. When you want to set user_id and status fields to a unique, then values of these fields should be not duplicated on this table. For that, we must delete duplicate data for these fields. I wrote a simple query for deleting duplicate data.

Example Query:

with dubl_data  as materialized  
(
    select 
        id, 
        user_id, 
        status, 
        max(id) over (partition by user_id, status) as max_id, 
        count(id) over (partition by user_id, status) as count_id 
    from 
        test_table
)
delete from test_table t1 
using dubl_data t2 
where 
    t1.id = t2.id 
    and t2.count_id > 1 
    and t2.max_id <> t2.id;

I explain to you how this query works. count_id > 1 is a condition used for selecting only duplicate data. But max(id) is used for calculating the max value of the id field for every duplicate data and finally delete all data from the duplicate data different max(id) using this t2.max_id <> t2.id condition.

Ramin Faracov
  • 3,032
  • 1
  • 2
  • 8