SQL UNION ALL to eliminate duplicates

Question

I found this sample interview question and answer posted on toptal reproduced here. But I don't really understand the code. How can a UNION ALL turn into a UNIION (distinct) like that? Also, why is this code faster?

QUESTION

Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates. Why might you want to do this? Hide answer You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:

ANSWER

SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X

The key is the AND a!=X part. This gives you the benefits of the UNION (a.k.a., UNION DISTINCT) command, while avoiding much of its performance hit.

This explanation is misleading... It's essentially doing a `UNION ALL` but filtering out the results from the first query. It's just filtering using the `WHERE` clause, rather than using an expensive `DISTINCT` operation. — Siyual, Jan 18 '17 at 20:46
Right, the only uniqueness you can guarantee from this particular query is the uniqueness of table `mytable`. In every way this query is the as `SELECT * FROM mytable`. The point, though, is that by filtering in your individual SELECT statements, you may be able to get quicker results, then resorting to a `UNION`. — JNevill, Jan 18 '17 at 20:49
The answer is wrong. This query does not eliminate duplicates that alreaydy exist in the table, while UNION DISTICT would do. — Paul Spiegel, Jan 18 '17 at 20:51
@PaulSpiegel, assuming there is a primary key that does not involve `a`, there will be no duplicates because of the `select *`. — Dan Bracuk, Jan 18 '17 at 20:56
@DanBracuk - I'm aware of that. But that is not said in the question. One could also say `select *` is equivalent to `select distinct *`. But that is not always true. — Paul Spiegel, Jan 18 '17 at 21:01
Can you quote the source of the question and answer? Are you sure this is ALL of the content of the question and answer? — DVT, Jan 18 '17 at 21:19
https://www.toptal.com/sql/interview-questions Quoted exactly. — user3685285, Jan 18 '17 at 21:39

Bill Karwin · Accepted Answer · 2017-01-18T22:55:48.297

But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:

SELECT * FROM mytable WHERE a=X OR b=Y

This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.

Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.

But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.

SELECT * FROM mytable WHERE a=X 
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y

UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....

We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.

But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.

If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.

SELECT * FROM mytable WHERE a=X 
UNION ALL 
SELECT * FROM mytable WHERE b=Y AND a!=X

These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.

The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.

So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.

Wow, this answer makes a lot of sense to me. But why are some people saying it's wrong? Does it fail in some cases? — user3685285, Jan 19 '17 at 13:42
The objection is that it doesn't account for cases where the table itself has duplicate rows (which should never be the case in a normalized database, but hey it happens). The query with `UNION DISTINCT` would eliminate duplicates from the result set. The `UNION ALL` keeps such duplicates. — Bill Karwin, Jan 19 '17 at 16:13
What this explanation fails to mention is what makes the two conditions joint by an AND (`WHERE b=Y AND a!=X`) any more efficient than the original conditions joined by an OR (`WHERE a=X OR b=Y`). If we don't now the AND is somehow inherently better than the OR, then the whole exercise is for naught and all we've done is make the code more obscure. — Agent Friday, Aug 11 '22 at 04:17
@AgentFriday, Wow, have you read my answer? The optimization strategy and why it works is literally all it talks about. — Bill Karwin, Aug 11 '22 at 04:24
@Bill Karwin, Yes, and I just read it carefully again. No mention of why the AND condition in the final solution is in any way advantageous over the OR condition of the original. — Agent Friday, Aug 11 '22 at 21:47

score 1 · Answer 2 · answered Mar 01 '21 at 16:24

1

The most simple way is like this, especially if you have many columns:

SELECT *
  INTO table2
  FROM table1
  UNION
SELECT *
  FROM table1
  ORDER BY column1

answered Mar 01 '21 at 16:24

Martin Schumann

11
1

Hi, thanks for your answer. This creates `table2`, and is not what the author of the question was asking. Try answering "*How can a UNION ALL turn into a UNION (distinct) like that? Also, why is this code faster?*" from the post. – Connor Low Mar 01 '21 at 16:56

Sachin Jawale · Answer 3 · 2021-05-29T15:42:43.417

1

I guest this is right (Oracle):

select distinct * from (

select * from test_a

union all

select * from test_b
);

edited May 29 '21 at 15:42

answered May 29 '21 at 15:09

Sachin Jawale

21
2

Vasyl Zvarydchuk · Answer 4 · 2017-01-18T22:36:49.127

The question will be correct if the table has unique identifier - primary key. Otherwise every select can return many the same rows.

To understand why it can faster let's look at how database executes UNION ALL and UNION.

The first is simple joining results from two independent queries. These queries can be processed in parallel and taken to client one by one.

The second is joining + distinction. To distinct records from 2 queries db needs to have all them in memory or if memory is not enough db needs to store them to temporary table and next select unique ones. This is where performance degradation can be. DB's are pretty smart and distinction algorithms are developed good but for large result sets it could be a problem anyway.

UNION ALL + additional WHERE condition can be faster if an index will be used while filtering. So, here the performance magic.

score 0 · Answer 5 · edited Aug 09 '17 at 13:08

0

I guess it will work

select col1 From (
select row_number() over (partition by col1 order by col1) as b, col1 
from (
select col1  From u1
union all
select col1 From u2 ) a
) x
where x.b =1

edited Aug 09 '17 at 13:08

answered Aug 09 '17 at 13:01

Amit Dhiman

1

score 0 · Answer 6 · edited Apr 20 '20 at 23:52

0

This will also do the same trick:

select * from (
select * from table1
union all 
select * from table2
) a group by 
columns
having count(*) >= 1

or

select * from table1 
union all
select * from table2 b 
where not exists (select 1 from table1 a where a.col1 = b.col1)

edited Apr 20 '20 at 23:52

d219

2,707
5
31
36

answered Apr 20 '20 at 23:25

Layman

1

SQL UNION ALL to eliminate duplicates

6 Answers6

Linked