Inserting millions of records with deduplication SQL

Question

This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...

How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same time using email de-duplication (some subscribers may already exist in site2, but we don't want to insert those that already exist)?

I understand how to simply get the records from site 1 and add them into site 2, but how would we do this on such a large scale, and not causing data duplication? Any reading sources would be more than helpful for me, as ive found that a struggle.

i.e.: Table 1: site1Subscribers

site1Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

Table 2: site2Subscribers

site2Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

Which DBMS are you using? Oracle? PostgreSQL? DB2? – Dec 12 '12 at 11:38 — , Dec 12 '12 at 11:38
My set-up is a typical LAMP; mySQL under Linux Ubuntu (10). – kirgy Dec 12 '12 at 12:04 — kirgy, Dec 12 '12 at 12:04

score 3 · Accepted Answer · answered Dec 12 '12 at 11:15

3

I would try something like this:

insert into site2Subscribers
select * from site1Subscribers s1
left outer join site2Subscribers s2
    on s1.subEmail = s2.subEmail
where s2.subEmail is null;

The left outer join along with the null check will return only those rows from site1Subscribers that have no matching entry in site2Subscribers.

answered Dec 12 '12 at 11:15

Daniel Hilgarth

171,043
40
335
443

Better yet, use `EXISTS` which translates into a slightly cheaper semi-join. – usr Dec 12 '12 at 11:16
@usr: Have you any sources for that? I always wondered what to use. – Daniel Hilgarth Dec 12 '12 at 11:17
I just noticed the cost difference in the execution plans. Maybe there isn't even a runtime difference, but why would the cost formulae differ if the runtime cost was identical? I should benchmark this... – usr Dec 12 '12 at 11:18

Inserting millions of records with deduplication SQL

1 Answers1