0

We have always had problems with DFS, but recently it has gotten worse with no apparent cause and it's becoming harmful. We have one master server and DFS connections to other four servers. The four severs don't modify any files, so all replications always propagate from the master to the four other servers. The replicated directory has about 900,000 files. In recent weeks, every time we check the DFS backlogs have hundreds of thousand of files. For instance, at the moment, the master server replicating about 700,000 files to three of the four servers while the fourth one is fine. Sometimes, only one is off, sometimes two and this time three. Also, it is never the same set of servers. It is inconceivable that something periodically touches all 900,000 files. The biggest change which happens is a scheduled update of several thousand files every six hours.

Does anybody have the same problem? Is it a known issue?

Update: (This is also an answer to some of the questions raised by Jeff Miles). The problem again happened few hours ago. I setup some probes in the morning and monitored the servers during the day, and at a seemingly random time, three backlogs ballooned to 3 million changes (which is more than the total number of files) within a minute. Nothing interesting in the DFS Event Log. Even no "started initial replication". Only a couple of "DFS connection lost or unresponsive" errors, but they happened about 10 minutes after the fact. Most likely because something choked on the huge backlogs. More importantly, the fourth server is fine. This indicates that the 3 million changes are most likely bogus. Also, I can't imagine anything changing that many files within such a short interval. Regarding the technical setup; it is a combination of Win2003R2 and Win2008R2. Could it be a problem?

Jan Zich
  • 173
  • 4
Adrian Godong
  • 577
  • 4
  • 9
  • 20
  • what backup process/software/settings are you using on the master? – pablo Mar 12 '11 at 23:06
  • So users don't create/modify files on the other servers? What's the point of using DFS then? – joeqwerty Mar 12 '11 at 23:28
  • @joeqwerty It's like a distribution list. Changes in the master will be automatically propagated to the other servers. – Adrian Godong Mar 13 '11 at 01:38
  • @Adrian: I'm asking what your exact use of DFS is if you're not using it as a distributed file system for your users. What's the point of using DFS if you're only allowing changes to the "master" files? – joeqwerty Mar 13 '11 at 05:07
  • @joeqwerty Do you have any recommendation for a better solution? I need multiple servers mirrored exactly based on one server (to keep maintaining the files easy) and changes are synced as real time as possible. – Adrian Godong Mar 14 '11 at 00:46
  • @Adrian: I'm trying to understand your use of DFS. If you're not using it to present the files to your users why have you implemented it? What's your reason or purpose for synching the files? Is it for data redundancy or backup purposes? – joeqwerty Mar 14 '11 at 17:17
  • @joeqwerty We present the files to the users. They just don't have change access to it (and it's served via IIS). – Adrian Godong Mar 23 '11 at 16:53

3 Answers3

1

First, verify your topology. Carefully review the replication connections under the "Connections" tab in your replication set properties:

  • The hub should have one outbound connection from itself to each of the remotes
  • Each of the remotes should have only one outbound connection, from itself back to the hub

I have seen full mesh topologies accidentally added that result in problems like you are seeing.

Other possible culprits: - Antivirus scanning or file indexing on one or more of the servers or one of their clients. (Opening a file updates its access time, which must then be replicated to all peers.) - One or more very large files jamming up replication - This should show in your DFS-R logs.

Finally, do you need DFS-R, or could a regular robocopy be used to keep the folders in sync?

Paul Doom
  • 841
  • 6
  • 9
  • Topology: correct. AV: possible. Large files: unlikely. Robocopy: we need the files replicated ASAP, does Robocopy do this? – Adrian Godong Mar 13 '11 at 01:37
  • Notes: running Robocopy in "loop" mode: http://www.windowsitpro.com/article/migration/robocopy-xp010-faq.aspx – Adrian Godong Mar 13 '11 at 01:37
  • read this article http://support.microsoft.com/kb/947726 taking note of what changes cause a replication, taking note of what changes cause a replication. Some application is manipulating these items – tony roth Mar 13 '11 at 02:07
  • We run Acronis backup. But we have always used it, and never had problems like it. However, since about a year ago, the number of files in the replicated directly has started rapidly growing. So it could have been gradually building and we just started to notice. – Jan Zich Mar 13 '11 at 03:11
1

If you're seeing hundreds of thousands of files in the backlog on a regular basis, I would guess that something is changing the security ACLs on your files, especially if you aren't seeing much network traffic while the backlog clears.

One way to check out what is modifying these files is to turn on Auditing. Ned Pyle with the Microsoft Directory Services team recently put out a blog that uses Global Object Access Auditing that might help you determine what is changing: http://blogs.technet.com/b/askds/archive/2011/03/10/global-object-access-auditing-is-magic.aspx

I would check your DFSR event log too, and look for any event ID 4102 (started initial replication) or 4104 (initial replication finished). If your files aren't being modified, the only reason I can think of for hundreds of thousands of files in the backlog is initial replication. If your DFSR service is crashing it could corrupt the DFSR database and trigger initial replication.

If you can, I'd try to use Read Only DFSR, described here: http://blogs.technet.com/b/askds/archive/2010/03/08/read-only-replication-in-r2.aspx

I imagine based on your Server 2003 tag that you can't do it yet, but its worth a mention based on your use case.

http://blogs.technet.com/b/askds/archive/2010/03/08/read-only-replication-in-r2.aspx

Jeff Miles
  • 2,065
  • 2
  • 19
  • 27
  • Thanks for the tips on RO DFSR. We are still running mixed environment right now but this is definitely an option when we (finally) migrate to 2008 R2 only environment and keep DFS. – Adrian Godong Mar 14 '11 at 16:40
  • I checked eventlog: there is no event ID 4102 nor 4104 so I don't think those caused the problem. Also Global Object Access Auditing works only on 2008 R2. Time to upgrade I guess. – Adrian Godong Mar 14 '11 at 16:59
1

Since you are seeing unreasonable amounts of files being replicated within a very short period, there must be an application that is changing file attributes or USN Journal values without changing file data e.g. Backup software changing the Archive bit would trigger this as well as some AV software.

Testing Anti-Virus Application Interoperability with DFS Replication

I would set up a test replication group to troubleshoot against and test items such as Backup software, AV software, etc effects on replication. I would also in addition to the other recommendations you have received, log and watch for changes in the USN Journal without file data changing. The link provided is a good article on checking for applications changing the USN Journal without changing file data and therefore causing excessive replication.

Watch out for File Screens, Quotas, etc as well. I have seen some scenarios where a file screen stopped replication altogether.

Is you Antivirus software set to scan the DFSR-Private folders (Staging, Conflict and Deleted, etc) ?

-Ken

Ken
  • 11
  • 1