5

I need to synchronize tables between 2 databases daily, the source is MSSQL 2008, the target is MSSQL 2005. If I use UPDATE, INSERT, and DELETE statements (i.e. UPDATE rows that changed, INSERT new rows, DELETE rows no longer present), will there be performance improvements if I perform the DELETE statement first? i.e. so that the UPDATE statement doesn't look at rows that don't need to be updated, because they will be deleted.

Here are some other things I need to consider. The tables have 1-3 million+ rows, and because of the amount of transactions and business requirements, the source DB needs to remain online, and the query needs to be as efficient as possible. The job will be run daily in a SQL server agent job on the target DB. On top of that, I am a DB rookie!

Thanks StackOverflow community, you are awesome!

Steve H
  • 73
  • 1
  • 6
  • 1
    I would say DELETE then UPDATE then INSERT, but not necessarily for performance reasons. (1) a delete will *potentially* free up space that could later be used by an insert (2) if you do the insert first, you *might* update those rows unnecessarily if you perform the update after. – Aaron Bertrand Oct 02 '13 at 18:51
  • Thanks Aaron Bertrand! – Steve H Oct 02 '13 at 19:05

3 Answers3

8

I'd say, first you do delete, then update then insert, so you don't have to update rows which will be deleted anyway and you'll not update rows which are just inserted.

But actually, have you seen SQL Server merge syntax? It could save you a great amount of code.

update I have not checked performance of MERGE statement against INSERT/UPDATE/DELETE, here's related link given by Aaron Bertrand for more details.

Community
  • 1
  • 1
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 5
    I recommend against MERGE. Total character count of your code may be shorter, but [lots of potential pitfalls](http://dba.stackexchange.com/questions/44067/can-i-optimize-this-merge-statement/44069#44069). – Aaron Bertrand Oct 02 '13 at 18:53
  • @AaronBertrand thanks, just after I said merge, I've realized that I've never checked performance, so have to edit the answer AND read your link – Roman Pekar Oct 02 '13 at 18:54
  • @AaronBertrand I second your feeling on the reliability of merge. On the other hand the chance is very high that you either don't hit any bugs or discover them during testing. We are now 5 years after its introduction - I hope that the common bugs are weeded out, and that the original developer has been disbanded.; I think the fact that merge removes some triple redundancy from the OPs syncing code removes more errors than the bugs introduce. – usr Oct 02 '13 at 19:00
  • I think since the job will be running on the target machine which is 2005 the merge is not an option. http://stackoverflow.com/questions/12621241/can-i-use-the-merge-statement-in-sql-server-2005 – jones6 Oct 02 '13 at 19:04
  • @usr Disagree, the "triple redundancy" involves patterns that have been tried and true since SQL Server 6.x, and not all of the bugs (or performance deficiencies in some cases) are incorrect results bugs that will necessarily be obvious during testing. I know of at least 15 bugs that are either still active or closed as by design / won't fix and I have written a cautionary article about this that will publish soon. – Aaron Bertrand Oct 02 '13 at 19:05
  • No, the MERGE statement is not supported in MSSQL Server 2005. But prior to this, I did pull data from 2005 -> 2008 using MERGE (from the 2008 server). It seemed to work ok; but that's a fair warning, I will look into reliability issues. Thanks Roman Pekar! – Steve H Oct 02 '13 at 19:08
  • BTW, I have used MERGE for some time now and it works without issue. Also, another reason for doing it in this order is in case you have FK constraints. But, that opens a whole other can of worms. – Steve Oct 02 '13 at 19:09
  • UPDATE before DELETE when foreign key constraints are present; otherwise, you may have referential integrity errors. You may not be able to delete tuples that are being referenced in other tables, unless you first update those tables and orphan the referenced tuple. Alternatively, you may disable all constraints temporarily to overcome this scenario. – Jorge Garcia Feb 19 '16 at 22:27
1

Rule of Thumb: DELETE, then UPDATE, then INSERT.

Performance aside, my main concern is to avoid any potential Deadlocks when:

  1. Updating something you will immediately Delete.
  2. Inserting something you may immediately try to Update.

If you only modify what is necessary and use transactions correctly, then you may use any order.
P.S. Someone suggested using MERGE - I've tried it a few times and my preference is to never use it.

MikeTeeVee
  • 18,543
  • 7
  • 76
  • 70
0

I think Roman's answer is what you were looking for in your current situation: DELETE, UPDATE, INSERT (or MERGE.)

Now there are other possible routes which can make things even faster, but with a rather different process:

1. Consider saving all orders in a file that you, once in a while, run against the target

Assuming both databases are exactly the same, for each SQL order that modifies the 2008 database, save that order in a .sql file which you later execute against the 2005 database. You have to consider locking the file while writing to it, and maybe have some kind of redundancy. However, this means you need no access to the 2008 database at all while doing the work on the 2005 database. In other words, no side effects to the 2008 database speed.

Pitfall: you may miss a statement and the destination will not be an exact equivalent...

2. Ongoing replication

I do not know about MSSQL enough to tell you of a good tool to do automatic replication (see here: http://technet.microsoft.com/en-us/library/ms151198.aspx), but I'd bet you can find a good tool. MySQL (http://dev.mysql.com/doc/refman/5.0/en/replication.html) and PostgreSQL (http://wiki.postgresql.org/wiki/Streaming_Replication) have such tools and those are all free.

This would be the solution I would choose. Depending on the tool you use, it can be really very well optimized meaning that the impact on the live system will be minimal and the 2005 duplicate will be up to date within seconds (depending on whether it is a long distance remote connection or not, the amount of work, the setup of each server, the Internet connections, etc.)

The pitfall is obviously that it adds an ongoing process on the database, but if you find an MSSQL tool that works like the streaming replication of PostgreSQL, it makes use of a copy of the journal which means it is dead fast (no heavy use of disk I/O.)

3. Cluster Database (like Cassandra)

This would involve a change of database which I'm totally sure you're not ready to do (especially because most of those systems do not offer SQL,) but I thought that it would be a good thing to talk about in your situation.

A system like Cassandra (http://cassandra.apache.org/) automatically replicate its data on many computers. It can actually be setup to replicate all the data 100% or X% of data per computer with redundancy in case of failure (a computer that breaks down). This alleviates the need for a specific copy on a separate computer because the performance can be increased simply by adding a few nodes to your system. (At less than $1,000 a computer, it is worth it! Frankly you could create a Peta Byte system for $50k or less and end up with something a lot faster than any SQL database...)

The main problem is that the use of those clusters is completely different than SQL. But that could be a solution for big businesses having large databases that need to be really fast and they do not want to invest in a mini-computer (think Cobol and $250k computers that manage 100 million rows in a few milli-seconds...)

With Cassandra you can run extremely heavy batch processes on back end computers that do not make a dent to the front end system!

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
  • Thanks, that is very helpful. I had begun looking at the first route, saving to a file. In hindsight, I believe this would have been the ideal choice given the circumstances. I really only need to update once a day, so some differences are allowable. – Steve H Oct 02 '13 at 20:55
  • as for the second route you mentioned, we did discuss this but it was decided that ongoing activity was not desireable since the load on the DB fluctuates. – Steve H Oct 02 '13 at 20:56
  • The third option is something I will look into. Looks like an attractive option, and something that I will discuss with the developers. Thank you, Alexis. – Steve H Oct 02 '13 at 20:58