0
Database 1      Database 2
Material 1      Submaterial 1     Submaterial 2     
ABC             123               ABC
BCD             XYZ               234
DEF             456               DEF
XYZ             BCD               EFG
EFG             897               345
PQR             678               PQR

Please help me to check if column Material 1 from database 1 is existing in either Submaterial 1 and Submaterial 2 from database 2. Here Submaterial 1 & 2 can be merged to find the value.

Abdul Rasheed
  • 6,486
  • 4
  • 32
  • 48
venky ps
  • 25
  • 1
  • 5

3 Answers3

0

One approach is to left join the first table to the second table twice using each of the second table's two submaterial columns. Then check that a given material from the first table had a match in at least one of the second table's columns.

SELECT DISTINCT
    [Material 1]
FROM table1 t1
LEFT JOIN table2 t2
    ON t1.[Material 1] = t2.[Submaterial 1]
LEFT JOIN table2 t3
    ON t1.[Material 1] = t3.[Submaterial 2]
WHERE
    t2.[Submaterial 1] IS NOT NULL OR t3.[Submaterial 2] IS NOT NULL;

You may remove DISTINCT from the query if you don't expect that a material in the first table can appear more than once.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • The last sentence is not clear. It seems that it's sufficient having the uniqueness in T1, but actually it's necessary in T2 and T3. BTW, this solution is highly inefficient compared to IN or EXISTS – ildanny Jan 25 '18 at 10:52
  • @ildanny Because we need to check against two columns, `IN` or `EXISTS` might not be as efficient as you think. If you think you can do better, then post another answer with that query. Assuming there are indices for the join columns, then my answer should have reasonable performance. – Tim Biegeleisen Jan 25 '18 at 11:10
  • There're multiple reasons why, in this scenario, EXISTS is more efficient than a LEFT JOIN. Just an hint: with "exists", SQL Server scans T2 for a match, when it finds it, it stops scanning, with left you are obliging it to continue for other matches. If it finds them, you tell him to discard the result (distinct) so it's a double waste of resources. Let me few minutes to publish a full comparison. – ildanny Jan 25 '18 at 11:28
0

If both are in the same server and assuming three columns being in three different tables of the schema dbo, try

Select Material1 FROM Database1.dbo.Table1
WHERE Material1 IN
(
SELECT Submaterial1 MATERIAL FROM Database2.dbo.Table2
UNION
SELECT Submaterial2 FROM Database2.dbo.Table3
)
sanatsathyan
  • 1,713
  • 12
  • 18
0

Tim asked me to demonstrate that, in this scenario, "in+union" and "Exists" are more efficient than a "left join". I made several tests within this scenario. In my tests In/Exists were always better, but in one case where they are the same. The explanation is that you don't want to actually collect data from SubM1 or SubM2 but only to execute a test. With "exists", Sql Server will scan SubM1 for a match, if it finds it, it stops scanning. With a left join, it will continue to search other matches unless you define a unique index on SubM1 and SubM2. The more you are far from this scenario the more "exists" has a greater advantage.

I prepared this script: Union and exists have exactly the same execution plan so I omitted one. I generate about 923.521 in Material and about 850k rows in the other tables.

USE [Database1]
SET NOCOUNT ON;

IF object_ID('[dbo].[Material1]') is not null drop table [dbo].[Material1];
CREATE TABLE [dbo].[Material1]
(
    [MaterialCode] [varchar](20) NOT NULL PRIMARY KEY CLUSTERED
)
GO

insert into [Database1].[dbo].[Material1]( [MaterialCode] )
SELECT 
    (A+b+c+d) Code
from 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) A(A)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) b(b)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) c(c)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) d(d)
GO

use DATABASE2;
IF object_ID('[dbo].[Submaterial1]') is not null drop table [dbo].[Submaterial1];
IF object_ID('[dbo].[Submaterial2]') is not null drop table [dbo].[Submaterial2];
CREATE TABLE [dbo].[Submaterial1](
    [MaterialCode] [varchar](20) NULL
) 
CREATE index IX ON [dbo].[Submaterial1]([MaterialCode]);
GO

insert into DATABASE2.[dbo].[Submaterial1]( [MaterialCode] )
SELECT 
    (A+b+c+d) Code
from 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) A(A)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) b(b)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) c(c)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('U')) d(d)
GO

CREATE TABLE [dbo].[Submaterial2](
    [MaterialCode] [varchar](20) NULL
) 
CREATE index IX ON [dbo].[Submaterial2]([MaterialCode]);

insert into DATABASE2.[dbo].[Submaterial2]( [MaterialCode] )
SELECT 
    (A+b+c+d) Code
from 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) A(A)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) b(b)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S'),('T'),('U')) c(c)
    cross join 
    (values ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9'),('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('J'),('K'),('I'),('L'),('M'),('N'),('O'),('P'),('Q'),('R'),('S')) d(d)

GO

SET NOCOUNT ON;

GO

SET statistics IO ON;
SET statistics time ON;

SELECT distinct M.[MaterialCode] 
FROM Database1.[dbo].[Material1] M
    LEFT JOIN Database2.dbo.[Submaterial1] SM1
        ON M.MaterialCode = SM1.MaterialCode
    LEFT JOIN Database2.dbo.[Submaterial2] SM2
        ON M.MaterialCode = SM2.MaterialCode
WHERE 
    SM1.MaterialCode is not null OR SM2.MaterialCode is not null;
;
SELECT [MaterialCode] 
FROM Database1.[dbo].[Material1] M
WHERE 
    exists (SELECT * from Database2.dbo.[Submaterial1] SM1 where SM1.MaterialCode = M.MaterialCode)
    OR 
    exists (SELECT * from Database2.dbo.[Submaterial2] SM2 where SM2.MaterialCode = M.MaterialCode)

You can adjust the script to simulate multiple scenarios.

This script, executed as it is, uses this execution plan and you can see "exists" has a big advantage:
Execution plan 1

Increasing the number of duplicated rows in SubM1 and SubM2 the advantage is bigger and bigger: Execution plan 2

If you create a unique index on SubM1 and SubM2, Left join and Exists have the same performances. This result is expected and matches my explanation on why "exists" is better then a "left join": Execution plan 2

ildanny
  • 371
  • 2
  • 10
  • There are many factors not addressed here. For example, `MERGE` join is the fasted you can get over two ordered sets, while `NESTED LOOP` can scale poorly on large data sets. This means that with larger data-sets you may see completely different relative costs, or even a different execution plan, due to the difference in the statistics. – MatBailie Jan 25 '18 at 12:27
  • This SO question and the linked blog post more rigorously demonstrate the benefits of EXISTS *(though with some of the complexities in this particular case)*. One important note is that the blog doesn't claim anything to ***always*** be better than something else ;) https://stackoverflow.com/questions/14693118/does-except-execute-faster-than-a-join-when-the-table-columns-are-the-same/14693264#comment83803008_14693264 – MatBailie Jan 25 '18 at 12:27
  • @Mat: The merge and nested loops are not so relevant in this context. T1 has a clustered index and there're indexes on the other two tables as well. So all three data sets are sorted. Adding more data, SQLS will start using a merge join. I wrote that "exists" is better in this scenario not every possible scenario. Your link is about "EXCEPT". My post is about EXISTS, is it possible you linked the wrong post? – ildanny Jan 25 '18 at 14:03
  • If you go to the blog post linked in the accepted answer it compares `IN()`, `EXISTS()`, `EXCEPT`, `LEFT JOIN`, etc. – MatBailie Jan 25 '18 at 14:08
  • @MatBailie: No. He tests "not exists()" that is different. For example when using NOT IN() and NOT EXISTS(), you have the problem of null values. – ildanny Jan 25 '18 at 14:14
  • I agree with the consideration to `NULL`, outside of that, however, I would expect that the performance characteristics would be broadly equivalent. – MatBailie Jan 25 '18 at 15:05
  • I expect the same, but still a semi-join is not an anti-join. In other terms, I cannot understand the purpose of asking me to read an article about something different if you are the first to say that "you expect". It means that the article doesn't help clarifying the problem. At the end, my answer is not supposed to be the definitive guide about this. I only wanted to show Tim Biegeleisen that he was wrong thinking that his query is better than the other. So, exactly, what's you point? – ildanny Jan 25 '18 at 15:26
  • Read my first comment again then. This is not a rigorous comparison. The link was to demonstrate an example of a rigorous comparison. Although I agree with your conclusion I would not consider this answer to be sufficient evidence in and of itself *(especially as the first sentence describes the answer as an assertion about performance)*. – MatBailie Jan 25 '18 at 15:28
  • I already replied to your first message. I confirm that the merge join is not relevant. But I appreciated your suggestion to make a more rigorous comparison so I improved my post. I still think that the initial explanation was the correct way to read this problem, but if you have any suggestion, I will read it with pleasure. – ildanny Jan 25 '18 at 17:21