joins versus correlated exists subqueries

Question

In Derby and/or MySQL, I am wondering about the performance of:

select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  and exists (select * from contentblob c1 where c1.documentid = c0.documentid 
              and c1.blobid = ?)
  order by c0.documentid

Note that DIFFERENT values will be supplied for the two blobid comparisons. What this query does is return the id's that occur in the table (at least) twice: once with blobid1 and once with blobid2.

compared to the equivalent self-join,

assuming, of course, indices on the documentid and blobid columns of the contentblob table.

In case it isn't clear, the relevant schema here is a table with two varchar columns, documentid and blobid. They are composed to form the primary key.

If there's nothing for it but to inquire of the query analyzer in the db, I'll go do that.

score 2 · Accepted Answer · answered Dec 16 '10 at 08:07

Generally, correlated sub-queries are more expensive than joins, even outer self-joins, though in this case all you need is a regular inner self-join:

SELECT c0.documentid
  FROM contentblob AS c0
  JOIN contentblob AS c1 ON c1.documentid = c0.documentid
 WHERE c0.documentid > ? 
   AND c0.blobid = ?
   AND c1.blobid = ?
   AND c0.blobid < c1.blobid
 ORDER BY c0.documentid

The last condition avoids seeing two entries for a pair of rows with the same document ID and the relevant blob ID values; without it, you'd see the document ID twice, once for c0.blobid = v1 and once for c0.blobid = v2.

You might want to move the c1.blobid condition into the ON clause, but the optimizer should do that for you anyway. Also check that your optimizer is using an appropriate index. It isn't entirely clear what that index would be (but I think an index with blobid as the leading column would probably perform best), but if the tables are large and the optimizer is using a sequential scan of the table, you have problems.

James Anderson · Answer 2 · 2010-12-16T03:40:30.057

0

I think the left join would be clearer

  select c0.documentid from contentblob as c0 where c0.documentid > ? 
  and c0.blobid = ? 
  left join contentblob c1 on c0.documentid = c1.documentid 
       and c1.blobid = ? 
  order by c0.documentid

Mileage will vary for different database engines, and, will also vary with table sizes, index cardinality etc. but my gut feeling is the join would perform better.

edited Dec 16 '10 at 03:40

answered Dec 16 '10 at 03:24

James Anderson

27,109
7
50
78

See edit for what I hope will clarify. Two different blobids are involved. – bmargulies Dec 16 '10 at 03:27
Ah! Should have spotted that -- changed answer accordingly! – James Anderson Dec 16 '10 at 03:40
I've always felt that the EXISTS() statements were clearer to read and they don't create potential duplicates in the result set. – Al W Dec 16 '10 at 03:43

joins versus correlated exists subqueries

2 Answers2

Linked