3

I have a large directory that needs to be synced from a local server to my web server and am looking for the most efficient method of doing this. This directory contains 113k files in 14k directories and is roughly 5GB in size. Local to remote comparison of each file/directory is taking several hours to complete even with little changes.

Local machine is Win7, remote is CentOS 5.5

My current setep is using a scripted synchronize with WinSCP, but as said, the crawling through the directories over a single SCP connection is taking hours. The number of files that require updating should be much smaller than the overall set and I'd like to find a way of scripting the sync locally, logging which files were changed, and then only hitting the web server for the upload of the new files.

Any suggestions?

Infraded
  • 31
  • 1
  • 5

3 Answers3

5

Have a look at Deltacopy or Syncrify which are both based on the rsync protocol. They will only transfer files that have changed or are new etc. More importantly they will only transfer the changed blocks from large files. Rsync will probably already be installed on your Centos machine

user9517
  • 115,471
  • 20
  • 215
  • 297
  • 1
    While I agree a rsync-based tool is probably the best way to go and will help. Unfortunately no matter what tool you use syncing a large number of files will tend to be slow compared to the speed you would get with one a few large files using the same amount of space. – Zoredache Mar 15 '11 at 19:23
  • 1
    @Zoredache: One thing that rsync will provide (especially if you can a client/server setup) is that each system will be using its local disk to find changes, and communicating change information far more efficiently than dumping the entire tree (and metadata) over SCP, which will be terribly slow. – afrazier Mar 15 '11 at 19:43
1

If changes are occurring locally only (i.e. a one-way sync) you might think about just using an archiver (zip, tar, etc) to archive the modified files for transport up to the remote server. Presumably you can use the modification date, archive bit, or, worst-case, maintain a second local copy to use as the basis for determining which files have changed.

Rsync and other delta-copy programs are nice, but I suspect that your problem may be simple enough to solve w/o going to that extreme. With a large number of small files you'll also experience a lot of delays using rsync because of latency.

Since your source is a Windows machine you could use the "Archive" bit as a telltale for which files have been modified (assuming the update process is toggling the archive bit). You could do something simple like:

@echo off
set SRC=C:\source
set STAGING=C:\staging

rem Copy all files from source to staging, including subdirectories,
rem where "Archive" bit is set.
xcopy "%SRC%\*" "%STAGING%\" /e /s /a

rem Untick archive bit on all files in source
attrib /S /D -A "%SRC%\*"

That would leave the "staging" directory filled with only the files that have changed (albeit with empty subdirectories for every directory where files didn't change, too). It would also reset the archive bit on all the files in all the subfolders. You could ZIP that staging directory up (using your favorite command-line ZIP program) and ship it out to the remote server for decompression.

This doesn't give you any delta compression, but at an average size of 51KB / file it sounds like delta compression won't help you too much and the latency "win" of this simplistic method may be better for you.

Evan Anderson
  • 141,881
  • 20
  • 196
  • 331
1

Unison is another possibility. The important part is getting something you can run on the server via SSH and let the server-side process handle the disk I/O at that end rather than walking the entire filesystem remotely. Unison can be run via ssh and uses the rsync algorithm to only transfer changed parts of files.

afrazier
  • 700
  • 4
  • 7