Using "UNION ALL" and "GROUP BY" to implement "Intersect"

Question

I'v provided following query to find common records in 2 data sets but it's difficult for me to make sure about correctness of my query because of that I have a lot of data records in my DB.

Is it OK to implement Intersect between "Customers" & "Employees" tables using UNION ALL and apply GROUP BY on the result like below?

SELECT D.Country, D.Region, D.City
  FROM (SELECT DISTINCT Country, Region, City 
          FROM Customers
         UNION ALL
        SELECT DISTINCT Country, Region, City
          FROM Employees) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

So can we say that any record which exists in the result of this query also exists in the Intersect set between "Customers & Employees" tables AND any record that exists in Intersect set between "Customers & Employees" tables will be in the result of this query too?

the result will give you what you are looking for. Common Country, Region, City in both tables. — KeithL, Oct 16 '17 at 13:45
Well normally you would simulate INTERSECT with a DISTINCT INNER JOIN or with a IN statement. — Raymond Nijland, Oct 16 '17 at 13:45
Maybe this helps http://www.mysqltutorial.org/mysql-intersect/ — Raymond Nijland, Oct 16 '17 at 13:50
@RaymondNijland -- He is doing a distinct on both tables. So there can only be one record for each Country, Region, City in either table. — KeithL, Oct 16 '17 at 13:51

score 2 · Accepted Answer · answered Oct 16 '17 at 17:19

So is it right to say any record in result of this query is in "Intersect" set between "Customers & Employees" "AND" any record that exist in "Intersect" set between "Customers & Employees" is in result of this query too?

YES.

... Yes, but it won't be as efficient because you are filtering out duplicates three times instead of once. In your query you're

Using DISTINCT to pull unique records from employees
Using DISTINCT to pull unique records from customers
Combining both queries using UNION ALL
Using GROUP BY in your outer query to to filter the records you retrieved in steps 1,2 and 3.

Using INTERSECT will return identical results but more efficiently. To see for yourself you can create the sample data below and run both queries:

use tempdb
go
if object_id('dbo.customers') is not null drop table dbo.customers;
if object_id('dbo.employees') is not null drop table dbo.employees;

create table dbo.customers
(
  customerId int identity,
  country    varchar(50),
  region     varchar(50),
  city       varchar(100)
);

create table dbo.employees
(
  employeeId int identity,
  country    varchar(50),
  region     varchar(50),
  city       varchar(100)
);

insert dbo.customers(country, region, city) 
values ('us', 'N/E', 'New York'), ('us', 'N/W', 'Seattle'),('us', 'Midwest', 'Chicago');
insert dbo.employees
values ('us', 'S/E', 'Miami'), ('us', 'N/W', 'Portland'),('us', 'Midwest', 'Chicago');

Run these queries:

SELECT D.Country, D.Region, D.City
FROM 
(
  SELECT DISTINCT Country, Region, City 
  FROM Customers
  UNION ALL
  SELECT DISTINCT Country, Region, City
  FROM Employees
) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

SELECT Country, Region, City
FROM dbo.customers
INTERSECT
SELECT Country, Region, City
FROM dbo.employees;

Results:

Country     Region     City
----------- ---------- ----------
us          Midwest    Chicago

Country     Region     City
----------- ---------- ----------
us          Midwest    Chicago

If using INTERSECT is not an option OR you want a faster query you could improve the query you posted a couple different ways, such as:

Option 1: let GROUP BY handle ALL the de-duplication like this:

This is the same as what you posted but without the DISTINCTS

SELECT D.Country, D.Region, D.City
FROM 
(
  SELECT Country, Region, City 
  FROM Customers
  UNION ALL
  SELECT Country, Region, City
  FROM Employees
) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

Option 2: Use ROW_NUMBER

This would be my preference and will likely be most efficient

SELECT Country, Region, City
FROM 
(
  SELECT
    rn = row_number() over (partition by D.Country, D.Region, D.City order by (SELECT null)), 
    D.Country, D.Region, D.City
  FROM 
  (
    SELECT Country, Region, City 
    FROM Customers
    UNION ALL
    SELECT Country, Region, City
    FROM Employees
  ) AS D
) uniquify
WHERE rn = 2;

Hi Alan, thank you for response. You completely right. – Milad Rashidi Oct 17 '17 at 06:06 — Milad Rashidi, Oct 17 '17 at 06:06

Using "UNION ALL" and "GROUP BY" to implement "Intersect"

1 Answers1