I've been tasked with creating a data synchronization process between a CSV file generated by another vendor and upwards of 300 separate-but-structurally-identical CRM databases. All CRM databases are defined in the same SQL Server instance. Here are the specifics:
The source data will be a CSV which contains a list of all email addresses where clients have opted-in to marketing communications. This CSV file will be sent in its entirety every night, but will contain record-level date/time stamps which will allow me to select only those records which have been modified since the last processing cycle. The CSV file will potentially have many hundreds of thousands of rows, though the expected changes on a daily basis will be substantially lower than that.
I'll be selecting data from the CSV and will be converting each row into a custom List<T>
object.
Once the CSV is queried and the data has been transformed, I will need to compare the contents of this List<T>
against the CRM databases. This is due to the fact that any given email address contained in the CSV file may:
- Not exist in any of the 300 databases
- Exist in one of the 300 databases
- Exist in multiple databases
In any case where there is a match between an email address in the master CSV list and any CRM database, the matching CRM record will be updated with the values contained in the CSV file.
At a high, very generic level, I was thinking that I would have to do something like this:
foreach(string dbName in masterDatabaseList)
{
//open db connection
foreach(string emailAddress in masterEmailList)
{
//some helper method that would execute a SQL statement like
//"IF EXISTS ... WHERE EMAIL_ADDRESS = <emailAddress>" return true;
bool matchFound = EmailExistsInDb(emailAddress)
if (matchFound )
{
//the current email from the master list does exist in this database
//do necessary updates and stuff
}
}
}
Is this the most efficient approach? I'm not to keen on having to hit 300 databases potentially thousands of times to see if each and every email in the master CSV list exists. Ideally, I'd like to generate a SQL statement along the lines of:
"SELECT * FROM EMAIL_TABLE WHERE EMAIL_ADDRESS IN(email1,email2, email3,...)"
This would allow for a single query to be executed against the database, but I don't know whether this approach would be any better / more efficient, especially because I would have to dynamically generate the SQL and could potentially open it up to injection.
What is the best practice in this scenario? Because I have 300 databases that need to be compared each time, I'm looking for an approach that will yield the best results with the least amount of processing time. In my production code, I will be implementing a multi-threaded approach so that multiple databases can be processed simultaneously, so any approach would need to be thread-safe.