0

We've been using SQL Server merge replication for a few years to synchronise data between our data centres, but we are now suffering with a big performance issue. This may be because the amount of data we are synchronising has increased a lot this year.

Our publisher is an always-on data centre in the UK. Our subscriber is a mobile data centre that travels around the world and is on for periods of up to a week at a time, approx. 25 times a year. However, it also spends the same amount of time (if not more) switched off whilst on its travels - it is a well travelled data centre!

We have 5 database that we synchronise on these servers. However, one of our databases has high numbers of data changes between periods of subscriber downtime and our issue is that it take days to catch up when the server is powered up - the other databases are fine.

Downloads from publisher to subscriber run at about 1.5 rows a second (which is annoying when we have hundreds of thousands of rows) but strangely uploads from subscriber to publisher run about ten times faster.

Things I have checked / tried: • all tables have non-clustered primary keys on guid columns that have the rowguid property set • changing the generation levelling threshold doesn't help • setting the agent profile to high volume doesn't help • running a trace at the publisher and subscriber shows the queries are all running very fast (less than 20 m/s generally, but there are gaps of 200 m/s or so between some batches of queries) • analysis on our WAN link shows we have huge amounts of bandwidth spare • analysis on our servers show we have huge amounts of Ram and CPU spare

Some of the places the subscriber is at do suffer from high latency but this doesn't seem to have an impact - 300 m/s or 100m/s and we still get the same poor performance.

One things I did wonder about - does the replication confirm to the publisher every time it has successfully processed a row at the subscriber? If we have thousands of rows and there is a latency on the line will this compound the issue if it confirms each item? If this does happen, is there a way to batch up messages between publisher and subscriber?

Any help that you can offer will be gladly received!

Thanks

Mark

Mark E
  • 43
  • 9
  • Assuming the 5 databases your are synchronizing are split up into 5 different publications, do all subscribers that subscribe to the 5 publications exhibit a download delivery rate of 1.5 rows/sec? – Brandon Williams Mar 10 '15 at 16:46
  • Also, what value did you set generation leveling threshold to? – Brandon Williams Mar 10 '15 at 17:13
  • No - the other databases are much quicker but they also have a fraction of the data. We've tried 0, 1000 and 10,000 as possible values. These typically affect the time taken to iterate though changes before starting, but don't affect the download rate. – Mark E Mar 10 '15 at 17:21
  • Are you utilizing any row and/or join filters? Also, are you replicating any LOBs like varchar(max), nvarchar(max) and/or varbinary(max)? – Brandon Williams Mar 10 '15 at 17:51
  • We use replication join filters on some tables and LOBs on some tables, but again the performance of those compared to tables without LOBs and join filters is no different – Mark E Mar 11 '15 at 07:35
  • Are the columns used for the join filters indexed? – Brandon Williams Mar 11 '15 at 17:51
  • Hi all, We got to the bottom of this in the end. It was our use of nvarchar(max) columns that was stopping the replication from using batches. What used to take 3 hours now takes 50 seconds just by changing the data type. Here is the lesson learnt: "nvarchar(max) is a replication killer" Thanks – Mark E Apr 24 '15 at 15:54

1 Answers1

0

We got to the bottom of this in the end.

It was our use of nvarchar(max) columns that was stopping the replication from using batches. What used to take 3 hours now takes 50 seconds just by changing the data type.

Here is the lesson learnt: "nvarchar(max) is a replication killer"

Thanks

Mark E
  • 43
  • 9