I have the following (simplified) table structure:
table_user_book
id | book_id | user_id | page
---+---------+---------+------
1 | b1 | c1 | 16
2 | b1 | c1 | 130
3 | b2 | c1 | 54
4 | b4 | c2 | 97
5 | b5 | c2 | 625
6 | b5 | p3 | 154
as you can see I have duplicates. I want a single entry for the pair user/book. page
is the last page read by the user. I have the following query that allows me to have the right merged data but I now want to update the table and delete the wrong records.
select
min("id") as "id",
max("page") as "page",
"book_id",
"user_id",
count(*) as "duplicates"
from myschema."table_user_book"
group by "book_id", "user_id"
order by count(*) DESC
I've run an update query using the above query as source matchng on "id".
I've then tried with the classic delete query WHERE a.id < b.id AND ...
to delete the wrong rows but my DB goes to timeout.
I dont have the exact delete query I used but it was quite similar to:
delete FROM
myschema."table_user_book" a
using myschema."table_user_book" b
where
a.id > b.id
and a."book_id" = b."book_id"
and a."user_id"= b."user_id";
I've also tried to use sets to delete the rows but that is a even worse solution.
delete FROM
myschema."table_user_book"
where "id" not in ( ... above filter query with only ids selected ... )
I have around 400k rows in that table and 10-15% of them are duplicates!
For now I'm creating a tmp table with the right values, dropping the original one, and recreating it:
create table myschema.tmp as ( ... query above ... )
drop table myschema.table_user_book
create table myschema.table_user_book as ( select * from myschema."tmp" )