3

I'v provided following query to find common records in 2 data sets but it's difficult for me to make sure about correctness of my query because of that I have a lot of data records in my DB.

Is it OK to implement Intersect between "Customers" & "Employees" tables using UNION ALL and apply GROUP BY on the result like below?

SELECT D.Country, D.Region, D.City
  FROM (SELECT DISTINCT Country, Region, City 
          FROM Customers
         UNION ALL
        SELECT DISTINCT Country, Region, City
          FROM Employees) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

So can we say that any record which exists in the result of this query also exists in the Intersect set between "Customers & Employees" tables AND any record that exists in Intersect set between "Customers & Employees" tables will be in the result of this query too?

Milad Rashidi
  • 1,296
  • 4
  • 22
  • 40

1 Answers1

2

So is it right to say any record in result of this query is in "Intersect" set between "Customers & Employees" "AND" any record that exist in "Intersect" set between "Customers & Employees" is in result of this query too?

YES.

... Yes, but it won't be as efficient because you are filtering out duplicates three times instead of once. In your query you're

  1. Using DISTINCT to pull unique records from employees
  2. Using DISTINCT to pull unique records from customers
  3. Combining both queries using UNION ALL
  4. Using GROUP BY in your outer query to to filter the records you retrieved in steps 1,2 and 3.

Using INTERSECT will return identical results but more efficiently. To see for yourself you can create the sample data below and run both queries:

use tempdb
go
if object_id('dbo.customers') is not null drop table dbo.customers;
if object_id('dbo.employees') is not null drop table dbo.employees;

create table dbo.customers
(
  customerId int identity,
  country    varchar(50),
  region     varchar(50),
  city       varchar(100)
);

create table dbo.employees
(
  employeeId int identity,
  country    varchar(50),
  region     varchar(50),
  city       varchar(100)
);

insert dbo.customers(country, region, city) 
values ('us', 'N/E', 'New York'), ('us', 'N/W', 'Seattle'),('us', 'Midwest', 'Chicago');
insert dbo.employees
values ('us', 'S/E', 'Miami'), ('us', 'N/W', 'Portland'),('us', 'Midwest', 'Chicago');

Run these queries:

SELECT D.Country, D.Region, D.City
FROM 
(
  SELECT DISTINCT Country, Region, City 
  FROM Customers
  UNION ALL
  SELECT DISTINCT Country, Region, City
  FROM Employees
) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

SELECT Country, Region, City
FROM dbo.customers
INTERSECT
SELECT Country, Region, City
FROM dbo.employees;

Results:

Country     Region     City
----------- ---------- ----------
us          Midwest    Chicago

Country     Region     City
----------- ---------- ----------
us          Midwest    Chicago

If using INTERSECT is not an option OR you want a faster query you could improve the query you posted a couple different ways, such as:

Option 1: let GROUP BY handle ALL the de-duplication like this:

This is the same as what you posted but without the DISTINCTS

SELECT D.Country, D.Region, D.City
FROM 
(
  SELECT Country, Region, City 
  FROM Customers
  UNION ALL
  SELECT Country, Region, City
  FROM Employees
) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

Option 2: Use ROW_NUMBER

This would be my preference and will likely be most efficient

SELECT Country, Region, City
FROM 
(
  SELECT
    rn = row_number() over (partition by D.Country, D.Region, D.City order by (SELECT null)), 
    D.Country, D.Region, D.City
  FROM 
  (
    SELECT Country, Region, City 
    FROM Customers
    UNION ALL
    SELECT Country, Region, City
    FROM Employees
  ) AS D
) uniquify
WHERE rn = 2;
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18