SQL Server Comparing 2 tables having Millions of Records

Question

I have 2 tables in SQL Server: Table 1 and Table 2.
Table 1 has 500 Records and Table 2 has Millions of Records.
Table 2 may/may not have the 500 Records of Table 1 in it.

I have to compare Table 1 and Table 2. But the result should give me only the Records of Table 1 which has any data change in Table 2. Means the Result should be less than or equal to 500.

I don't have any primary key but the columns in the 2 tables are same. I have written the following query. But I am getting time out exception and it is taking much time to process. Please help.

With CTE_DUPLICATE(OLD_FIRSTNAME ,New_FirstName,      
    OLD_LASTNAME ,New_LastName,      
    OLD_MINAME ,New_MIName ,      
    OLD_FAMILYID,NEW_FAMILYID,ROWNUMBER)      
as (        
 Select distinct      

    OLD.FIRST_NAME AS 'OLD_FIRSTNAME' ,New.First_Name  AS 'NEW_FIRSTNAME',      
    OLD.LAST_NAME AS 'OLD_LASTNAME',New.Last_Name AS 'NEW_LASTNAME',      
    OLD.MI_NAME AS 'OLD_MINAME',New.MI_Name AS 'NEW_MINAME',      
    OLD.FAMILY_ID AS 'OLD_FAMILYID',NEW.FAMILY_ID AS 'NEW_FAMILYID',      
    row_number()over(partition by OLD.FIRST_NAME ,New.First_Name,      
    OLD.LAST_NAME ,New.Last_Name,      
    OLD.MI_NAME ,New.MI_Name ,      
    OLD.FAMILY_ID,NEW.FAMILY_ID       
    order by OLD.FIRST_NAME ,New.First_Name,      
    OLD.LAST_NAME ,New.Last_Name,      
    OLD.MI_NAME ,New.MI_Name ,      
    OLD.FAMILY_ID,NEW.FAMILY_ID )as rank       
 From EEMSCDBStatic OLD,EEMS_VIPFILE New where            
 OLD.MPID <> New.MPID    and old.FIRST_NAME <> New.First_Name  
 and OLD.LAST_NAME <> New.Last_Name and OLD.MI_NAME <> New.MI_Name   
 and old.Family_Id<>New.Family_id    
 )      
 sELECT OLD_FIRSTNAME ,New_FirstName,      
    OLD_LASTNAME ,New_LastName,      
    OLD_MINAME ,New_MIName ,      
    OLD_FAMILYID,NEW_FAMILYID FROM CTE_DUPLICATE where rownumber=1

You can make with `JOIN` (if you don't have keys the `JOIN` will be slowest). But I think that comparing only one field will be faster... and with `CASE` or `IF` if the comparision fail (distinct values) proceed with the second field and then the third and then ... Worst case will be 500 rows and all the fields, but conditional. — José M. Carnero, Jun 27 '16 at 06:23
Why do you write `Table 1` instead of `EEMSCDBStatic`? `Table 1` = `EEMSCDBStatic` Any indexes on that tables? — Vojtěch Dohnal, Jun 27 '16 at 06:37
What do you mean with *which has any data change*? If there is no unique key to tell which record corresponds with which record in the other table how would you know if a *Michael Smith* in both tables is the same person? And do you want to return the rows which exist in both tables or the one *which has any data change*? What exactly is a *data change* in your context? — Shnugo, Jun 27 '16 at 06:39
Also you are using <> condition, which will join each row of your table 1 with millions of rows of table 2 (as there only one match possible). So you will be getting millions of rows instead of 'Result should be less than or equal to 500' — an33sh, Jun 27 '16 at 06:44

score 0 · Answer 1 · answered Jun 27 '16 at 06:46

I think the main problem here is that your query is forcing the DB to fully multiply your tables, which means processing ~500M combinations. It happens because you're connecting any record from T1 with any record from T2 that has at least one different value, including MPID that looks like the unique identifier that must be used to connect records.

If MPID is really the column that identifies records in both tables then your query should have a bit different structure:

SELECT old.FIRSTNAME, new.FirstName,
       old.LASTNAME, new.LastName,
       old.MINAME, new.MIName,
       old.FAMILYID, new.FAMILYID
    FROM EEMSCDBStatic old
         INNER JOIN EEMS_VIPFILE new ON old.MPID = new.MPID
    WHERE old.FIRST_NAME <> New.First_Name  
          AND OLD.LAST_NAME <> New.Last_Name
          AND OLD.MI_NAME <> New.MI_Name   
          AND old.Family_Id <> New.Family_id
    ORDER BY old.FIRSTNAME, new.FirstName,
             old.LASTNAME, new.LastName,
             old.MINAME, new.MIName,
             old.FAMILYID, new.FAMILYID

A couple of other thoughts:

If you're looking for any change in a record (even if only one column has different values), you should use ORs in the WHERE clause, not ANDs. Now you're only looking for records that changed values in all columns. For instance, you'll fail to find a person who changed his or her first name but decided to keep last name.
You should obviously consider indexing your tables if it's possible.

score 0 · Answer 2 · edited May 23 '17 at 11:52

Surely it is pointless to use DISTINCT keyword together with ROWNUMBER.

See this sql query distinct with Row_Number.

You are doing CROSS JOIN, which is terribly big in your case.

Perhaps in that condition you

where   OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name and ...

you wanted to have OR instead of AND?

It is also not entirely clear why you use ROWNUMBER at all - perhaps to find the best match.

All this is because as @Shnugo correctly remarked, the logic behind your comparing is faulty - you must have some logic defined that would JOIN the tables (Like First and second name must be the same).

SQL Server Comparing 2 tables having Millions of Records

2 Answers2