Self join vs group by when counting duplicates

Question

I'm trying to count duplicates based on a column of a table in an Oracle Database. This query using group by:

select count(dockey), sum(total)
from
(
select doc1.xdockeyphx dockey, count(doc1.xdockeyphx) total
from ecm_ocs.docmeta doc1
where doc1.xdockeyphx is not null
group by doc1.xdockeyphx
having count(doc1.xdockeyphx) > 1
)

Returns count = 94408 and sum(total) = 219330. I think this is the correct value.

Now, trying this other query using a self join:

select count(distinct(doc1.xdockeyph))
from ecm_ocs.docmeta doc1, ecm_ocs.docmeta doc2
where doc1.did > doc2.did
and doc1.xdockeyphx = doc2.xdockeyphx
and doc1.xdockeyphx is not null
and doc2.xdockeyphx is not null

The result is also 94408 but this one:

select count(*)
from ecm_ocs.docmeta doc1, ecm_ocs.docmeta doc2
where doc1.did > doc2.did
and doc1.xdockeyphx = doc2.xdockeyphx
and doc1.xdockeyphx is not null
and doc2.xdockeyphx is not null

Is returning 1567466, which I think is wrong.

The column I'm using to find duplicates is XDOCKEYPHX and the DID is the primary key of the table.

Why is the value sum(total) different from the result of the last query? I can't see why the last query is returning more duplicate rows than expected.

+1 not for making more confusions... but for explaining what you are doing — Srini V, Mar 13 '14 at 16:32

vogomatix · Answer 1 · 2014-03-13T15:06:23.207

You don't need the complexity of your last where clause

where doc1.did > doc2.did
and doc1.xdockeyphx = doc2.xdockeyphx
and doc1.xdockeyphx is not null
and doc2.xdockeyphx is not null

If you think about it, doc2.xdockeyphx cannot be null if doc1.xdockeyphx is not null. perhaps it is better expressed by joining tables....

select count(*)
from ecm_ocs.docmeta doc1
join ecm_ocs.docmeta doc2
on doc1.xdockeyphx = doc2.xdockeyphx
where doc1.xdockeyphx is not null and doc1.did > doc2.did

Your first two queries report distinct/grouped results where your last one simply reports all results, which is why the counts differ.

score 0 · Answer 2 · answered Mar 13 '14 at 14:25

0

In the third query, column names are duplicated due to the use of (*), you should maybe replace select count(*) by select count(doc1.*)

answered Mar 13 '14 at 14:25

Goon10

150
4

score 0 · Answer 3 · answered Mar 13 '14 at 16:46

Lets keep it simple.

SELECT FROM_ID,
       TO_ID
FROM   TABLE1;

This fetches

Note: To Id is the PK on this table

On your first query (Of course I changed the predicates)

SELECT COUNT ( DOCKEY ), SUM ( TOTAL )
FROM   (SELECT   DOC1.TO_ID DOCKEY, COUNT ( DOC1.TO_ID ) TOTAL
        FROM     TABLE1 DOC1
        GROUP BY DOC1.TO_ID
        HAVING   COUNT ( DOC1.TO_ID ) > 0);

Produces

5    5

Here I selected rows grouped by TO_ID which will produce five rows in the sub query and then the aggregation in the main query causes it to be counted as 5.

Now in the second query, even if you replace the select with COUNT(*) as in the third you should get the same count. The reason is I am joining them on the PK.

SELECT COUNT ( DISTINCT ( DOC1.TO_ID ) )
FROM   TABLE1 DOC1, TABLE1 DOC2
WHERE  DOC1.TO_ID = DOC2.TO_ID;

5


SELECT COUNT(*)
FROM   TABLE1 DOC1, TABLE1 DOC2
WHERE  DOC1.TO_ID = DOC2.TO_ID;

5

But in your case, you are not using the PK in the join and you use it as a predicate.

TABLE1.COL1 = TABLE1.COL1 in a self join will make it as a JOIN ON TABLE1.COL1 > TABLE1.COL1 in a self join will make it as Cartesian product.

So in your second query, you used DISTINCT which saved you from this duplicates and not in the third which is a mere count of returned rows. To check this, you can do a select *

score 0 · Accepted Answer · answered Mar 14 '14 at 20:23

Thanks to @vogomatix, since his answer helped me understand my problem and where I was wrong. The last query actually results in a number of rows showing each pair of duplicates with no repetitions, but it's not suitable to count for them as the sum(total) from the first one. Given this case:

DID | XDOCKEYPHX
---------------
1   |    1
2   |    1
3   |    1
4   |    2
5   |    2
6   |    3
7   |    3
8   |    3
9   |    3

The first inner query would return

DID | XDOCKEYPHX
---------------
1   |    3
2   |    2
3   |    4

And the full query would be count = 3, meaning there are 3 documents with n duplicates, and the total duplicated documents sum(total) = 9.

Now, the second and third query, if we use just a select *, will give something like:

DID_1 | XDOCKEYPHX | DID_2
--------------------------
2     |     1      |    1
3     |     1      |    1
3     |     1      |    2
5     |     2      |    4
7     |     3      |    6
8     |     3      |    6
8     |     3      |    7
9     |     3      |    6
9     |     3      |    7
9     |     3      |    8

So now, the second query select count(distinct(xdockeyphx)) will give the correct value 3, but the third query select count(*) will give 10, which well, is incorrect for me since I wanted to know the sum of duplicates for each DID (9). What the third query gives you is all the pairs of duplicates, so you can then compare them or whatever. My misunderstanding was thinking that if I counted all the rows in the third query, I should get the sum of duplicates for each DID (sum(total) of the first query), which was a wrong idea and now I realize it.

Self join vs group by when counting duplicates

4 Answers4