Recommended programming pattern for multiple lookups

Question

I've been tasked with creating a data synchronization process between a CSV file generated by another vendor and upwards of 300 separate-but-structurally-identical CRM databases. All CRM databases are defined in the same SQL Server instance. Here are the specifics:

The source data will be a CSV which contains a list of all email addresses where clients have opted-in to marketing communications. This CSV file will be sent in its entirety every night, but will contain record-level date/time stamps which will allow me to select only those records which have been modified since the last processing cycle. The CSV file will potentially have many hundreds of thousands of rows, though the expected changes on a daily basis will be substantially lower than that.

I'll be selecting data from the CSV and will be converting each row into a custom List<T> object.

Once the CSV is queried and the data has been transformed, I will need to compare the contents of this List<T> against the CRM databases. This is due to the fact that any given email address contained in the CSV file may:

Not exist in any of the 300 databases
Exist in one of the 300 databases
Exist in multiple databases

In any case where there is a match between an email address in the master CSV list and any CRM database, the matching CRM record will be updated with the values contained in the CSV file.

At a high, very generic level, I was thinking that I would have to do something like this:

foreach(string dbName in masterDatabaseList)
{
    //open db connection

    foreach(string emailAddress in masterEmailList)
    {
        //some helper method that would execute a SQL statement like
        //"IF EXISTS ... WHERE EMAIL_ADDRESS = <emailAddress>" return true;

        bool matchFound = EmailExistsInDb(emailAddress)

        if (matchFound )
        {
            //the current email from the master list does exist in this database
            //do necessary updates and stuff
        }
    }
}

Is this the most efficient approach? I'm not to keen on having to hit 300 databases potentially thousands of times to see if each and every email in the master CSV list exists. Ideally, I'd like to generate a SQL statement along the lines of:

"SELECT * FROM EMAIL_TABLE WHERE EMAIL_ADDRESS IN(email1,email2, email3,...)"

This would allow for a single query to be executed against the database, but I don't know whether this approach would be any better / more efficient, especially because I would have to dynamically generate the SQL and could potentially open it up to injection.

What is the best practice in this scenario? Because I have 300 databases that need to be compared each time, I'm looking for an approach that will yield the best results with the least amount of processing time. In my production code, I will be implementing a multi-threaded approach so that multiple databases can be processed simultaneously, so any approach would need to be thread-safe.

score 1 · Answer 1 · answered Jun 17 '14 at 20:44

You seem to have the basic idea right. Hitting the database once for every line in the CSV is going to be way too slow. You can create a "where in" statement via LINQ like so:

var addresses = GetEmailAddresses();
var entries = ctx.Entries.Where(e => addresses.Contains(e.EmailAddress));

However, if you have too many addresses in your list, it'll take a long, long time to generate and evaluate your query. I'd recommend splitting your input list up into batches of a reasonable size (200 entries?), and then using the trick above to handle each batch with a single database check.

Once you've got that working, you can try a few other things to see if they make a measurable difference performance-wise:

Tweak the batch size.
Run the batches independently with varying degrees of parallelism.
Play with indexes on the database tables, particularly on the email address field.
Order the email addresses before breaking them into batches. It's possible that the db queries will take better advantage of hard disk caching strategies.

you can also take advantage of the `async` and `await` features found in newer version of the .NET framework. You can hit multiple Dbs at once (asynchronously), though I wouldn't necessarily suggest trying to hit 300 at once. — ps2goat, Jun 17 '14 at 20:50
Thanks for the replies. There are definitely some approaches I'll look into in the future. Of course, as soon as I posted this question, the client came back and changed their requirements - so this complicated lookup process isn't going to be needed any more. Figures. :) — Jake Bullet, Jul 18 '14 at 22:54

score 0 · Answer 2 · answered Jul 18 '14 at 18:59

You could put the contents of your csv list objects into a table value parameter. Then call a stored procedure, passing in that TVP. The stored procedure could then run a cursor through the 300 databases and joins to your table value parameter (using ad-hoc sql). It will basically be a loop that iterates 300 times which isn't too bad. Something like this:

CREATE PROCEDURE yourNewProcedure
(
    @TableValueParameter dbo.udtTVP READONLY
)
AS

DECLARE @dbName varchar(255)
DECLARE @SQL nvarchar(3000)

DECLARE DB_Cursor CURSOR LOCAL FOR
    SELECT DISTINCT name
    FROM sys.databases
    WHERE Name like '%yourdbs%'
OPEN DB_Cursor
FETCH NEXT FROM DB_Cursor INTO @dbName
WHILE @@FETCH_STATUS  = 0
BEGIN
    SET @SQL = 'UPDATE t
                SET t2.Field = t.Field              
                FROM @TableValueParameter t
                JOIN [' + @dbName + ']..TableYouCareAbout t2 ON t.Field = t2.Field '

    EXEC sp_executesql @SQL, N'@TableValueParameter dbo.udtTVP', @TableValueParamete

    FETCH NEXT FROM DB_Cursor INTO @dbName
END
CLOSE DB_Cursor
DEALLOCATE DB_Cursor

This is a really interesting approach and one I hadn't really considered. I've always preferred to do things like this in code because I feel like I have more control over debugging and error handling and logging and such - but this might be worth investigating in the future to compare it against how I would normally do things. — Jake Bullet, Jul 18 '14 at 22:57

Recommended programming pattern for multiple lookups

2 Answers2