3

This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...

How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same time using email de-duplication (some subscribers may already exist in site2, but we don't want to insert those that already exist)?

I understand how to simply get the records from site 1 and add them into site 2, but how would we do this on such a large scale, and not causing data duplication? Any reading sources would be more than helpful for me, as ive found that a struggle.

i.e.: Table 1: site1Subscribers

site1Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

Table 2: site2Subscribers

site2Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

kirgy
  • 1,567
  • 6
  • 23
  • 39

1 Answers1

3

I would try something like this:

insert into site2Subscribers
select * from site1Subscribers s1
left outer join site2Subscribers s2
    on s1.subEmail = s2.subEmail
where s2.subEmail is null;

The left outer join along with the null check will return only those rows from site1Subscribers that have no matching entry in site2Subscribers.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
  • Better yet, use `EXISTS` which translates into a slightly cheaper semi-join. – usr Dec 12 '12 at 11:16
  • @usr: Have you any sources for that? I always wondered what to use. – Daniel Hilgarth Dec 12 '12 at 11:17
  • I just noticed the cost difference in the execution plans. Maybe there isn't even a runtime difference, but why would the cost formulae differ if the runtime cost was identical? I should benchmark this... – usr Dec 12 '12 at 11:18