How to find which column causes row duplication in JOIN statement?

Question

MS SQL Server. I have this kind of situation:

SELECT
    x1,
    x2,
    x3,
    x4,
    x5,
    x6,
    x7,
    x8,
    x9,
    x10,
    x11,
    x12,
    x13
FROM
    T1  -- this has a lot of columns
JOIN
    T2, -- this has a lot of columns
    T3, -- this has a lot columns as well
    T4, -- same here and other tables
    T5,
    T6,
    T7,
    T8,
    T9,
    T10,
    T11,
    T12,
    T13,
    T14

And I have some duplicated rows, so the column which contains different values is not among those 13 selected columns. I want to resolve the issue by finding which column in joined tables has varying value and then decide what to do with it.

The solution is to select T1.*, T2.*, etc. and carefully analyse results. The thing is, I'd have to select a lot of columns and spend time to do a repetitive task.

Is there a tool/procedure/query which would do it automatically for me? I believe it's a common task to do when developing complex database queries.

EDIT

I've found an excellent tool which can simplify analysis of a select * query results.

Dbeaver, a free universal database manager, has a "Calc" panel in results view. One can select whatever cells/rows/columns and it calculates count and distinct count of values on the fly. Those can be grouped by columns which is exactly what is needed when searching which column has distinct values. Super useful.

Reference: https://github.com/dbeaver/dbeaver/wiki/Panels

No, this is not a "common task". When developing queries, it is better to start simple and gradually add complexity so you can see what is happening with the data. Plus, your sample query is using cross joins, which necessarily produce Cartesian products. — Gordon Linoff, Apr 25 '18 at 10:41
You seem to need to read some introductions to querying in SQL & even to relational databases. It is extremely unlikey that you want what you are asking for: every possible combination of a row from every table. — philipxy, Apr 26 '18 at 04:23
I'm not developing queries. I have no exact knowledge of a database (nobody has) and need to fix an existing query which failed for a single row among millions. It might be a glich in data and the query is ok. There are a lot of relations not specified by foreign keys as well as migration artifacts (eg. an "unique" identifier which is the same for a few hundreds of thousands rows). I ask for a way to analyse data, not how to write SQL or how relational databases work in a class room. — Lutosław, Jul 12 '18 at 10:16

Ashutosh A · Answer 1 · 2018-04-25T12:07:59.830

1

One option is this -

SELECT
    x1, count (DISTINCT T1.*),
    x2, count (DISTINCT T2.*),
    x3, count (DISTINCT T3.*)
FROM
    T1  -- this has a lot of columns
JOIN
    T2, -- this has a lot of columns
    T3
GROUP BY
    x1,
    x2,
    x3

This will tell you which tables have repeating rows (the ones with count > 1)... you can then analyze the join condition for those tables

edited Apr 25 '18 at 12:07

answered Apr 25 '18 at 10:34

Ashutosh A

975
7
10

this gives "incorrect syntax near *". Did you mean `count(distinct T1.x1)` (assuming that x1 is from T1) – Lutosław Apr 25 '18 at 12:13
No I did mean count (DISTINCT T1.*), as that will tell you the number of distinct rows of that table considering all of its columns - this does work on RedShift ... I did not try other DBs – Ashutosh A Apr 25 '18 at 12:30
In this case, you'll need to list names of all columns after DISTINCT. Hopefully SQL Server Management Studio can get the list for you without you having to type it – Ashutosh A Apr 25 '18 at 12:41

How to find which column causes row duplication in JOIN statement?

1 Answers1