copy large number of files over ssh

Question

I mount a remote server over ssh (using sshfs). I want to copy a large number of files from the remote server to local:

cp -rnv /mounted_path/source/* /local_path/destination

The command runs recursive copying that doesn't overwrite existing files. But the copying process is rather slow. I notice that it does not copy files in order. So my question is: can I speed the copying process by opening multiple terminals and run the same command above? Is a copying process smart enough to not overwriting the files copied by other processes?

score 4 · Accepted Answer · edited Apr 08 '18 at 21:46

…to answer the original question as stated…

There are two things to discuss here.

Using SSHFS

SSHFS uses the SFTP "subsystem" of the SSH protocol to make a remote filesystem appear as if it were mounted locally.

A crucial thing here is to note that SSHFS translates low-level syscalls into relatively high-level SFTP commands which are then translated into the syscalls executed on the server by the SFTP server, and then their results are sent back to the client and translated backwards.

There are several sources of slowness with this process:

There are distinct syscalls for distinct operations on files, and they are executed in the order the client issues them. Say, the client stat(2)-s the information on a file then open(2)-s that file then reads its data — by executing several read(2) calls in a row and then finally close(2)-s the file, all those syscalls have to be translated to SFTP commands, sent to the server and processed there with their results sent back to the client, translated back.
Even while SSHFS appears to implement certain clever hacks such as "read ahead" (speculatively reads more data than requested by the client), still, each syscall results in a round-trip to the server and back. That is, we send the data to the server then wait for it to respond then process its response. IIUC, SFTP does not implement "pipelining" — a mode of operation where we send commands before they are completed, so basically each syscall. While it's technically possible to have such processing to a certain degree, sshfs does not appear to implement it.

IOW, each syscall cp on your client machine makes, is translated to a request to the server followed by waiting for it to respond and then receiving its response.

Multiple `cp -n` processes run in parallel

The answer to the question of whether it's OK to employ multiple cp -n processes copying files in parallel depends on several considerations.

First, if they all will run over the same SSHFS mount, there will obviosly no speedup as all the syscalls issued by multiple cp will eventually hit the same SFTP client connection and will be serialized by it due to the reasons explained above.

Second, running several instances of cp -n running over distinct SSHFS mount points may be worthwhile — up to the limits provided by the network throughput and the I/O throughput by the medium/media under the target filesystem. In this case, it's crucial to understand that since SSHFS won't use any locking on the server, the different instances of cp -n must operate on distinct directory hierarchies — simply to not step on each others' toes.

Different / more sensible approaches

First, piping data stream created by tar, cpio or another streaming archiver and processing it remotely has the advantage that all round-trips for the file system operations are avoided: the local archiver creates the stream as fast as the I/O throughput on the source filesystem allows and sends it as fast as the network allows; the remove archiver extracts data from the stream and updates its local filesystem as fast as it allows. No round trips to execute elementary "commands" are involved: you just go as fast as the slowest I/O point in this pipeline allows you to; it's simply impossible to go faster.

Second, another answer suggested using rsync and you rejected that suggestion on the grounds of

rsync is slow as it has to checksum the files.

This is simply wrong. To cite the rsync manual page:

-c, --checksum

This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a "quick check" that (by default) checks if each file's size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit checksum for each file that has a matching size.

and

-I, --ignore-times

Normally rsync will skip any files that are already the same size and have the same modification timestamp. This option turns off this "quick check" behavior, causing all files to be updated.

--size-only

This modifies rsync's "quick check" algorithm for finding files that need to be transferred, changing it from the default of transferring files with either a changed size or a changed last-modified time to just looking for files that have changed in size. This is useful when starting to use rsync after using another mirroring system which may not preserve timestamps exactly.

and finally

--existing skip creating new files on receiver

--ignore-existing skip updating files that exist on receiver

That is,

By default rsync does not hash the file's contents to see whether a file has changed.
You can tell it to behave exactly like cp -n, that is, skip updating a file if it merely exists on the remote.

+1 very detailed answer. Now I understand more about both sshfs and rsync and their options. — Tu Bui, Oct 24 '17 at 12:47

score 2 · Answer 2 · answered Oct 12 '17 at 19:55

I'd recommend using two instances of tar or cpio piped over an SSH channel, like in

$ tar -C src/path -cf - . | ssh user@server tar -C dst/path -xf -

This approach has the advantage of consuming "full pipe" with a single flow of data (you can also stick | pv in between to see how it goes if you'd like some interactivity) compared to SSHFS (and SFTP) which does many round-trips between the server and the client.

The crucial bit here is that SSH is not merely about "logging in remotely", which many people assume it is, — it's rather about running any command remotely while connection its standard I/O streams to the local SSH client instance.

Note that if this happens on a secured LAN or other controlled environment, it's best to ditch SSH and use a pair of nc or socat instances — the listening one on the server and the sending one on a client. This approach does not spend CPU cycles on encrypting the data so you'll likely to be bounded by I/O on either of the three components: the source FS, the network and the destination FS.

thanks for the tip (+1). I will use it next time. However it does not directly answer my question which is about whether `cp -n` check file existence on the fly in general, and if using multiple such commands over ssh can speed up the process? — Tu Bui, Oct 14 '17 at 21:45

score 1 · Answer 3 · answered Oct 12 '17 at 15:16

1

No, the copying process is not smart to not overwrite the files copied by other processes. Executing multiple commands to copy the same files/folders is not a good idea.

Sometimes, you can't do much when the source and target machines are too far and network is slow. Here is a post to discuss why SSHFS is slow.

answered Oct 12 '17 at 15:16

Khaled

36,533
8
72
99

does it make any difference if my source has ~500 folders and each folder has several thousand files? I just run that command twice and notice the first process is copying folder #10 and the second process is dealing with folder #100. I am not sure if later the first process attempts to copy folder #10 as well. It clearly depends on how linux implements the `-n` option in the command. Does it check file existence at the destination on the fly (good) or pre-check it before copying any file (bad)? – Tu Bui Oct 12 '17 at 15:25
This is a core use case for rsync, have you looked it it instead? – TheFiddlerWins Oct 12 '17 at 15:37
I did. But rsync is slow as it has to checksum the files. Here my destination is originally empty so using rsync may be an overkill. – Tu Bui Oct 12 '17 at 16:30

score 1 · Answer 4 · answered Oct 15 '17 at 18:34

1

I suggest you to use rsync with avP flags. Example:

rsync -avP <Source>  <Destination>

answered Oct 15 '17 at 18:34

KK Patel

385
4
17

copy large number of files over ssh

4 Answers4

Using SSHFS

Multiple cp -n processes run in parallel

Different / more sensible approaches

Multiple `cp -n` processes run in parallel