Distinct and Inner join when millions of records exists

Question

I have my SQL Query is like this

 INSERT INTO staging.lps_data
    (
          col1
          ,col2
          ,col3
          ,col4
          ,col5

    )
            SELECT DISTINCT
                    col1
          ,col2
          ,col3
          ,col4
          ,col5
            FROM    tbl1 r WITH ( NOLOCK )
                    INNER JOIN tbl2 p WITH ( NOLOCK ) ON p.col1= r.col1
                    INNER JOIN tbl3 l WITH ( NOLOCK ) ON l.col2 = r.col2
        r.col1 NOT IN ( 'Foreclosure Deed',
                                           'Foreclosure Deed - Judicial',
                                           'Foreclosure RESPA',
                                           'Foreclosure Vendor Assignment Review',
                                           'Foreclosure Stop',
                                           'Foreclosure Screenprints Other',
                                           'Foreclosure Sale Audit',
                                           'Foreclosure Property Preservation',
                                           'Foreclosure Acquisition',
                                           'Foreclosure Notices Attorney Certification' )
                    AND ( r.col1  LIKE 'foreclosure%'
                          OR r.col1  = 'Vesting CT'
                        );

My tbl1 contains 100 million records , tbl2 contains 100 million records and tbl3 contains 1000million records. I gone thru the estimated execution plan The more load shows in Distinct. Note : I applied proper indexing on the tables.

I just try to solve this using batch process some thing like below

          INSERT INTO TEMP1
          SELECT SK_ID from tbl1 r where                        ( r.processname LIKE 'foreclosure%'                 OR r.processname = 'Vesting CT')
       EXCEPT
       SELECT SK_ID from tbl1 r where r.processname NOT IN ( 'Foreclosure Deed','Foreclosure Deed - Judicial',
                'Foreclosure RESPA',
                'Foreclosure Vendor Assignment Review',
                'Foreclosure Stop',
                'Foreclosure Screenprints Other',
                'Foreclosure Sale Audit',
                'Foreclosure Property Preservation',
                'Foreclosure Acquisition',
                'Foreclosure Notices Attorney Certification' )


               -- Load data into staging table in batch mode
              DECLARE @STARTID BIGINT=1, @LASTID BIGINT, @ENDID BIGINT;
              DECLARE @SPLITCONFIG BIGINT =1000 -- Process 1000 records as batch
              SELECT  @LASTID = MAX(ID)  FROM TEMP1


       WHILE @STARTID < @LASTID
       BEGIN
IF(@STARTID + @SPLITCONFIG > @LASTID)
    SET @ENDID = @LASTID
ELSE
    SET @ENDID = @STARTID + @SPLITCONFIG


    INSERT INTO staging.lps_data
               ( col1
          ,col2
          ,col3
          ,col4
          ,col5)
  SELECT DISTINCT
                 col1
          ,col2
          ,col3
          ,col4
          ,col5
            FROM    tbl1 r WITH (NOLOCK)
                    INNER JOIN TEMP1 SK WITH(NOLOCK) ON (r.SK_ID=SK.SK_ID AND SK.ID >=@STARTID AND SK.ID < @ENDID)
                    INNER JOIN tbl2 p WITH (NOLOCK) ON p.refinfoidentifier = r.refinfoidentifier 
                    INNER JOIN tbl3 l WITH (NOLOCK) ON l.loaninfoidentifier = r.loaninfoidentifier



    SET @STARTID = @ENDID
  END

With first approach my server got crashed with out of memeory, With second approach I could able to process complete records in 4 Hours.

Please suggest me if any thing else i can do to complete this process less than hour

Try increasing your batch size. Experiment with different values, but I would think the sweet spot will be somewhere around 50000. — Tab Alleman, Feb 11 '16 at 14:17
Exactly tried the same. But again memory usage getting increased and performance is poor than 1000. — santhosh padamatinti, Feb 11 '16 at 14:21
Could you post execution plan along with sample schema,can you also try creating filtered index on r.col1 predicates,but please test before hand — TheGameiswar, Feb 11 '16 at 14:22
First query is missing a WHERE? Stick all those constant strings in an indexed table for starters + NOT EXISTS. — Alex K., Feb 11 '16 at 14:22
Does one of the three tables contain *all* the columns col1 to col5? Are they unique in that table? — Thorsten Kettner, Feb 11 '16 at 14:51
yes col1, col2 is coming from tbl1 ,col3 is coming from tbl2 and col4, col5 is coming from tbl3. And tbl1-tbl3 is having one to many relation ship. — santhosh padamatinti, Feb 11 '16 at 15:15
Okay, to get this straight: Is col1+col2 unique in tbl1? Is col1+col3 unique in tbl2? Is col2+col4+col5 unique in tbl3? Or even subsets, just as col1 unique for tbl1 or col2 unique for tbl3? — Thorsten Kettner, Feb 11 '16 at 15:25
col1+col2 unique in tbl1, col1+col3 unique in tbl2 & col2+col4+col5 unique in tbl3 but in tbl3 for one col2 there will be multiple rows — santhosh padamatinti, Feb 11 '16 at 15:42

score 0 · Answer 1 · answered Feb 11 '16 at 14:20

Not sure what indexing you have in your table but try changing your SELECT with ROW_NUMBER() instead of using DISTINCT like

        SELECT 
       col1
      ,col2
      ,col3
      ,col4
      ,col5 FROM
      (
        SELECT 
       col1
      ,col2
      ,col3
      ,col4
      ,col5
      ,ROW_NUMBER() OVER(ORDER BY r.col1) as rn
        FROM    tbl1 r
                INNER JOIN tbl2 p ON p.col1= r.col1
                INNER JOIN tbl3 l ON l.col2 = r.col2
    WHERE
    r.col1 NOT IN ( 'Foreclosure Deed',
                                       'Foreclosure Deed - Judicial',
                                       'Foreclosure RESPA',
                                       'Foreclosure Vendor Assignment Review',
                                       'Foreclosure Stop',
                                       'Foreclosure Screenprints Other',
                                       'Foreclosure Sale Audit',
                                       'Foreclosure Property Preservation',
                                       'Foreclosure Acquisition',
                                       'Foreclosure Notices Attorney Certification' )
                AND ( r.col1  LIKE 'foreclosure%' OR r.col1  = 'Vesting CT')) xxx
      WHERE rn = 1;

score 0 · Answer 2 · edited Feb 02 '17 at 14:14

0

try inserting the strings of the NOT IN in a NEW_TABLE and left join it with the tbl1 filtering WHERE r.col1 IS NULL (best using an ID or integer instead of strings) or use r.col1 NOT EXISTS (SELECT 1 FROM NEW_TABLE WHERE ...)
Bye,
Igor

edited Feb 02 '17 at 14:14

Vao Tsun

47,234
13
100
132

answered Feb 11 '16 at 14:52

user1

556
2
6
22

score 0 · Answer 3 · answered Feb 11 '16 at 16:24

You are saying that col1+col2 is unique in tbl1. That means if we select these two columns from tbl1 alone we get no duplicates.
Then you use col1 to join tbl2. In tbl2 the pair col1+col3 is unique. So when we join we get no duplicates either. We get unique col1+col2+col3.
Then you use col2 to join tbl3. Here col2+col4+col5 are unique. So again no duplicates when we join. We get unique col1+col2+col3+col4+col5.

So what is the DISTINCT for when there cannot be duplicates? Remove it and your query should be way faster.

As you say the columns are unique in the tables, there will certainly be indexes on them, so it is very likely the tables themselves don't get read, but only the indexes as they contain all required data already. This is as good as it can get. I see no means for optimization here.

(Of course with large tables and indexes one can always think about partitioning to get data access faster.)

Distinct and Inner join when millions of records exists

3 Answers3