0

I have the following (simplified) table structure:

table_user_book
id | book_id | user_id | page 
---+---------+---------+------
 1 | b1      | c1      | 16
 2 | b1      | c1      | 130
 3 | b2      | c1      | 54
 4 | b4      | c2      | 97
 5 | b5      | c2      | 625
 6 | b5      | p3      | 154

as you can see I have duplicates. I want a single entry for the pair user/book. page is the last page read by the user. I have the following query that allows me to have the right merged data but I now want to update the table and delete the wrong records.

select 
    min("id") as "id", 
    max("page") as "page",      
    "book_id",
    "user_id",
    count(*) as "duplicates"
from myschema."table_user_book"
group by "book_id", "user_id"
order by count(*) DESC

I've run an update query using the above query as source matchng on "id".
I've then tried with the classic delete query WHERE a.id < b.id AND ... to delete the wrong rows but my DB goes to timeout.

I dont have the exact delete query I used but it was quite similar to:

delete FROM
    myschema."table_user_book" a
using myschema."table_user_book" b
where
   a.id > b.id
   and a."book_id" = b."book_id"
   and a."user_id"= b."user_id";

I've also tried to use sets to delete the rows but that is a even worse solution.

delete FROM
    myschema."table_user_book"
where "id" not in ( ... above filter query with only ids selected ... )

I have around 400k rows in that table and 10-15% of them are duplicates!

For now I'm creating a tmp table with the right values, dropping the original one, and recreating it:

create table myschema.tmp as ( ... query above ... )
drop table myschema.table_user_book
create table myschema.table_user_book as ( select * from myschema."tmp" )
hitech95
  • 55
  • 7
  • Is the problem that your timeout is too stringent, or that your delete is too slow? – jjanes Mar 29 '22 at 17:24
  • All the queries so far were quite fast. This is the second time facing the issue of timeout. Last time I've been suggested to use a [CTE](https://stackoverflow.com/questions/64972224/merge-duplicate-table-rows-that-are-used-in-a-relation-with-another-table), but in this case it still goes to timeout. – hitech95 Mar 29 '22 at 23:45
  • Did you try an `EXISTS` condition instead of `USING`? –  Mar 30 '22 at 05:48

1 Answers1

0

For deleting only duplicated records you have two variants:

  1. Creating a recursive function for detecting and deleting duplicates
  2. Using CTE commands (with as) by materialized technology and delete without recursive, by one-time execute the query.

For sample query please visit this url

Ramin Faracov
  • 3,032
  • 1
  • 2
  • 8