-1

I have three tables:


x (column1,..., x_id)


y (column1, column2, colum3, ..., x_id, y_id)


z (column1, column2, colum3,..., x_id, y_id, z_id)

My final goal: Delete duplicates, check the y table for duplicates, and if found check a second table for those duplicates, to see if the resultset is also the same, and if it is then delete one of them. I can join the tables but both of them are really big, so that is my main concern.

For example, This is just an example.

with cte as (
    select column1, column2, y_id, count(*) as cou from y
    group by column1, column2, x_id,
    having count(*) = 2 
)
select distinct tabel1.* from y tabel1
join cte tabel2 on tabel1.x_id= tabel2.x_id and tabel1.column1 = tabel2.column1 and tabel1.column2 = tabel2.column2

Then I want to check these found duplicates and if their items in another table are exactly the same. if it is then delete one of them.

For more clarification consider this example: y table: my table structure is very similar to this and this table is close to the reality of what I have.

column 1 column 2 x_id y_id
foo bar 1 1
foo bar 1 2
xx yy 2 3
zz kk 2 4
tt mm 2 5
baz qux 3 6
baz qux 3 7

has two groups with the same column 1 and column 2 and x_id values (the y_ids = (1 and 2) and y_id == (6,7).

Then for these rows, I want to check a second table to see if all the items are found in each other. meaning they have the same items in the z table. The items of the y_id = 1 and y_id = 2 are the same, so one of them in y table should be deleted.

{color:red;} column 1 column 2 column x_id y_id z_id
foo bar qux 1 1 1
foo2 bar2 qux2 1 1 2
foo bar qux 1 2 3
foo2 bar2 qux2 1 2 4
baz foo qux 3 6 5
baz2 foo2 qux1 3 6 6
baz3 foo3 qux2 3 6 7
baz foo qux 3 7 8
baz2 foo2 qux1 3 7 9

I want to delete the row with y_id = 1 from the y table ( or y_id = 2, which duplicate row doesn't make a difference). The tables are very big so a fast solution would be great. I know I can join the tables, but I was hoping I can find another solution. A solution that can use the group by ids instead and just look for the found items there inside the z table (third table)

DRA
  • 165
  • 8
  • What are you stuck on? – Dale K Aug 21 '23 at 21:15
  • I want to delete one instance of duplicates, but not only based on the first table columns, after duplicates are found there, check for their items in table z and then see if the query result based on the ids are actually the same (the second comparison also is done based on certain columns). I don't know how to put those ids I have in the first group in the query that checks for the duplicates in the second table. – DRA Aug 21 '23 at 21:22
  • 1
    Put them in a temp table? – Dale K Aug 21 '23 at 21:22
  • @DaleK Okay. But I need to know what I should compare with each other. I want to check the ones that are in the same group as the group found in the y table. so for example check the y_id = 1 and y_id = 2 and then then check y_id = 7 and y_id = 6 and so on. How can I define they belong to the same group? should I add another column for that? and then how to use them for comparison query in the z table? I'm very confused about this. – DRA Aug 21 '23 at 21:35
  • TBH I'm confused also... your question is more complicated than I have time to investigate. – Dale K Aug 21 '23 at 21:36
  • I think you should have used more concrete (less abstract) example to improve our understand of the problem. Anyways, what is stopping you from joining all 3 tables together (on all the columns you need) and then doing a `GROUP BY` or [`ROW_NUMBER() OVER( [partition] )`](https://learn.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver16) to identify duplicates? If table size is a problem, use temporary tables, e.g. copy a subset of data to temp tables and use them for analisys. (There is no efficient solution for finding dups across tables) – Alex Aug 22 '23 at 01:57
  • If you have a lot of columns to compare you may think about using `HASHBYTES` function to generate a single hash for all of your columns in a row and then compare hashes. – Alex Aug 22 '23 at 06:20
  • @Alex, because that's exactly what I want to do! just different values. And yes the tables are really big. millions of records in one. and if you can give me an example of how to do it that would be great. – DRA Aug 22 '23 at 14:40
  • I am struggling to understand exactly what you want and I have read your question a few times over. – Alex Aug 23 '23 at 02:09
  • 1
    @Alex I really don't think I am asking for something that hard. If we have three tables x y z, group the y records based on the foreign key of x inside y table, and in each group check if they have duplicates, if they do, then we want to check z table, in which we have the y foreign keys, so based on y keys of those duplicates in z, check if they have exactly the same items in z. like if we have a list for each, regardless of order, the items are the same. that means they are complete duplicates. but if they have different items in z they just have similar names in y table. – DRA Aug 23 '23 at 12:00
  • 1
    @Alex There are usually paired duplicates in each group in y table if there are any, so we can easily compare them. Anyway, I found a solution myself. – DRA Aug 23 '23 at 12:00

1 Answers1

0

I believe the below example will help me to delete the rows I want:

WITH cte AS (
    SELECT
        a.column1 AS a_column1,
        a.column2 AS a_column2,
        a.y_id AS a_y_id,
        a.x_id AS a_x_id,
        b.column1 AS a_column1,
        b.colum2 AS a_column2,
        b.y_id AS b_y_id ,
        b.x_id AS b_x_id 
    FROM YTable a
    JOIN YTable b
    ON a.column1 = b.column1 
        AND a.column2 = b.column2 
        AND a.x_id = b.x_id 
        AND a.y_id != b.y_id
    WHERE a.y_id < b.y_id
)
SELECT
    cte.*,
    CASE
        WHEN EXISTS (
            SELECT 1
            FROM (
                SELECT * FROM ZTable
                WHERE y_id = cte.y_id
            ) table1
            FULL JOIN (
                SELECT * FROM ZTable
                WHERE y_id = cte.y_id
            ) table2
            ON table1.colum1 = table2.column1
                AND table1.colum2 = table2.colum2
                AND table1.colum3 = table2.colum3
            WHERE table1.column1 IS NULL OR table2.column1 IS NULL
        )
        THEN 'Not Identical'
        ELSE 'Identical'
    END AS duplicated
FROM cte;
DRA
  • 165
  • 8