…to answer the original question as stated…
There are two things to discuss here.
Using SSHFS
SSHFS uses the SFTP "subsystem" of the SSH protocol
to make a remote filesystem appear as if it were mounted locally.
A crucial thing here is to note that SSHFS translates low-level
syscalls into relatively
high-level SFTP commands which are then translated into the syscalls
executed on the server by the SFTP server, and then their results are
sent back to the client and translated backwards.
There are several sources of slowness with this process:
- There are distinct syscalls for distinct operations on files,
and they are executed in the order the client issues them.
Say, the client
stat(2)
-s the information on a file
then open(2)
-s that file then reads its data — by executing several
read(2)
calls in a row and then finally close(2)
-s the file,
all those syscalls have to be translated to SFTP commands, sent to the
server and processed there with their results sent back to the client,
translated back.
Even while SSHFS appears to implement certain clever hacks such as
"read ahead" (speculatively reads more data than requested by the client),
still, each syscall results in a round-trip to the server and back.
That is, we send the data to the server then wait for it to respond then
process its response. IIUC, SFTP does not implement "pipelining" —
a mode of operation where we send commands before they are completed,
so basically each syscall.
While it's technically possible
to have such processing to a certain degree, sshfs
does not appear to
implement it.
IOW, each syscall cp
on your client machine makes, is translated
to a request to the server followed by waiting for it to respond and then
receiving its response.
Multiple cp -n
processes run in parallel
The answer to the question of whether it's OK to employ multiple cp -n
processes copying files in parallel
depends on several considerations.
First, if they all will run over the same SSHFS mount, there will obviosly
no speedup as all the syscalls issued by multiple cp
will eventually hit
the same SFTP client connection and will be serialized by it due to the reasons explained above.
Second, running several instances of cp -n
running over distinct
SSHFS mount points may be worthwhile — up to the limits provided by the
network throughput and the I/O throughput by the medium/media under
the target filesystem.
In this case, it's crucial to understand that since SSHFS won't use any
locking on the server, the different instances of cp -n
must operate
on distinct directory hierarchies — simply to not step on each others' toes.
Different / more sensible approaches
First, piping data stream created by tar
, cpio
or another streaming
archiver and processing it remotely has the advantage that all round-trips
for the file system operations are avoided: the local archiver creates
the stream as fast as the I/O throughput on the source filesystem allows
and sends it as fast as the network allows; the remove archiver extracts
data from the stream and updates its local filesystem as fast as it allows.
No round trips to execute elementary "commands" are involved: you just go
as fast as the slowest I/O point in this pipeline allows you to;
it's simply impossible to go faster.
Second, another answer suggested using rsync
and you rejected that
suggestion on the grounds of
rsync is slow as it has to checksum the files.
This is simply wrong.
To cite the rsync
manual page:
-c
, --checksum
This changes the way rsync checks if the files have
been changed and are in need of a transfer. Without this option, rsync
uses a "quick check" that (by default) checks if each file's size and
time of last modification match between the sender and receiver. This
option changes this to compare a 128-bit checksum for each file that
has a matching size.
and
-I
, --ignore-times
Normally rsync will skip any files that are
already the same size and have the same modification timestamp. This
option turns off this "quick check" behavior, causing all files to be
updated.
--size-only
This modifies rsync's "quick check" algorithm for
finding files that need to be transferred, changing it from the
default of transferring files with either a changed size or a changed
last-modified time to just looking for files that have changed in
size. This is useful when starting to use rsync after using another
mirroring system which may not preserve timestamps exactly.
and finally
--existing
skip creating new files on receiver
--ignore-existing
skip updating files that exist on receiver
That is,
- By default
rsync
does not hash the file's contents to see whether a file
has changed.
- You can tell it to behave exactly like
cp -n
, that is, skip updating
a file if it merely exists on the remote.