7

Rsync is going into "interruptible sleep" mode after transferring some files from a local folder to a NFS folder. The folder I am trying to backup contains more than 180gb of data.

This is what rsync outputs before it hangs:

[sender] expand file_list pointer array to 524288 bytes, did move

I am running Ubuntu Server 14.04 LTS with rsync version 3.1.0 protocol version 31 and I am running rsync with these options:

/usr/bin/rsync -rHAXxvvut --numeric-ids --progress {SRC_FOLDER} {NFS_FOLDER}

Thanks for any hints

jithujose
  • 241
  • 1
  • 3
  • 6

6 Answers6

4

Considering that the rsync you're using is an open-source software, it's quite easy to get access to related source code.

After downloading the main .tar.gz and applying the Ubuntu patch (rsync_3.1.0-2ubuntu0.4.diff.gz), you end up with exactly the code underlying the rsync you're using. Something like this:

$ mkdir rsync
$ cd rsync/
$ wget http://archive.ubuntu.com/ubuntu/pool/main/r/rsync/rsync_3.1.0.orig.tar.gz
$ wget http://archive.ubuntu.com/ubuntu/pool/main/r/rsync/rsync_3.1.0-2ubuntu0.4.diff.gz
$ gzip -d rsync_3.1.0-2ubuntu0.4.diff.gz
$ tar zxvf rsync_3.1.0.orig.tar.gz 
$ cd rsync-3.1.0/
$ patch -p1 < ../rsync_3.1.0-2ubuntu0.4.diff

Now a simple grep can quickly tell us the context of your error message:

$ grep -r 'expand file_list pointer array to' 
flist.c:        rprintf(FCLIENT, "[%s] expand file_list pointer array to %s bytes, did%s move\n",

So you're lucky, as your error message is used in a single fragment of a single file. Namely: flist.c.

Let's take a look:

flist.c context

It's relatively easy to guess that the routine containing the error message (lines 325, 326, 327, 328) is named flist_expand and sounds like something needed to ensure that the whole file list (to rsync) can be held in a properly sized in-memory structure (aka: the more files you need to rsync, the more memory is required to handle rsync-computations, and as such a list is not known "in advance", it needs to be dynamically computed, by allocating proper chunks of memory to a "list" [more or less]).

So, I would bet that your problem relies NOT on the size of data you're rsync-ing, but on the number of files. I'd try splitting your rsync in multiple sub-rsyncs, by focusing on internal subfolders.

Actually, it would be nice to better investigate the:

  1. line 328: (new_ptr == flist->files) ? " not" : "");
  2. line 334: out_of_memory("flist_expand");

but this goes much beyond my initial goal :-)

Anyway, I would bet that checking your logs you would find some "out of memory" message… :-)

HTH!

Damiano Verzulli
  • 4,078
  • 1
  • 21
  • 33
1

Two remarks that may help

  • have always been satisfied with the -a option (from the man "archive mode; equals -rlptgoD (no -H,-A,-X)")

  • rsync may be waiting for NFS to give access to a file. It seems NFS is able to actually get rsync locked (probably when overwriting a certain file), and it'd be interesting to see which file was being accessed by rsync just before entering its "sleep". There is a command to see what files are currently opened by rsync lsof -ad3-999 -c rsync (from askubuntu)

Déjà vu
  • 5,546
  • 9
  • 36
  • 55
1

Using rsync to transfer content to an NFS folder is likely to be very inefficient. Think about what's involved when rsync wants to get a checksum of the remote file, or to modify a remote file in place. You're far better to have rsync talk to an rsync process running on the file server. If at all possible I'd change that first and then look to see if your current problem still exists. I.e. use rsync over ssh or run an rsync daemon process, and leave NFS out of the picture entirely.

To find out what rsync is doing, strace might be useful:

strace -p <PID>

Or have strace start rsync like so:

strace rsync [rsync options] <src> <target>

To attach to a running process by PID may require root permissions, though that can be changed (by root).

mc0e
  • 5,866
  • 18
  • 31
0

I ran across this problem just now, while setting up rsnapshot, which uses rsync.

I can now reliably repeat the problem when rsnapshot invokes rsync on a very large directory (47,616 directory entries) for the second repeat rsnapshot of the same directory. The rsync was going from, and to, a local disk formatted using ext4.

The first time I saw it, the usual /usr/bin/rsync binary, stripped of symbols, "rsync version 3.2.3 protocol version 31" that is in my Xubuntu "22.04.1 LTS (Jammy Jellyfish)" release was invoked, and I did not have it under strace. So gdb backtraces were undecipherable to me. Running strace on the three "hung" rsync processes (parent, sender,and receiver) showed all three processes, after printing the "expand file_list pointer array", were looping on select(2) system calls, with 60 second timeouts.

So I rebuilt that same 3.2.3 version of rsync from source, and repeated the same test (after I figured out how to reproduce this), and got the following three gdb stack backtraces:

============== Backtrace #1 ==============
#0  0x00007f69b109b74d in __GI___select (nfds=nfds@entry=5, readfds=readfds@entry=0x7ffcd48d9c90, writefds=writefds@entry=0x7ffcd48d9d90, exceptfds=exceptfds@entry=0x7ffcd48d9d10, timeout=timeout@entry=0x7ffcd48d9c60)
at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000564b6ef0f544 in perform_io (needed=needed@entry=18, flags=flags@entry=2) at io.c:741
#2  0x0000564b6ef115f5 in write_buf (f=f@entry=4, buf=0x7ffcd48da019 "3dec1fdb7cfc0341_0", len=len@entry=18) at io.c:2125
#3  0x0000564b6eee9088 in send_file_entry (first_ndx=<optimized out>, ndx=<optimized out>, symlink_len=0, symlink_name=0x0, file=0x564b70627350, 
    fname=0x7ffcd48d9fd0 ".config/_some_application__that_I_use_/Cache/3dec1fdb7cfc0341_0", f=4) at flist.c:565
#4  send_file_name (f=f@entry=4, flist=flist@entry=0x564b6f6ae160, fname=fname@entry=0x7ffcd48dc150 ".config/_some_application__that_I_use_/Cache/3dec1fdb7cfc0341_0", stp=stp@entry=0x0, flags=flags@entry=65540, 
    filter_level=filter_level@entry=2) at flist.c:1604
#5  0x0000564b6eeea49d in send_directory (f=f@entry=4, flist=flist@entry=0x564b6f6ae160, fbuf=fbuf@entry=0x7ffcd48dc150 ".config/_some_application__that_I_use_/Cache/3dec1fdb7cfc0341_0", len=len@entry=72, 
    flags=flags@entry=65540) at flist.c:1839
#6  0x0000564b6eeea9b6 in send1extra (f=f@entry=4, file=file@entry=0x564b6f454a50, flist=flist@entry=0x564b6f6ae160) at flist.c:1992
#7  0x0000564b6eeeb21f in send_extra_file_list (f=f@entry=4, at_least=at_least@entry=1000) at flist.c:2078
#8  0x0000564b6eef8add in send_files (f_in=f_in@entry=5, f_out=f_out@entry=4) at sender.c:245
#9  0x0000564b6ef02b3c in client_run (f_in=5, f_out=4, pid=pid@entry=2780890, argc=argc@entry=1, argv=argv@entry=0x564b6efe65b0) at main.c:1317
#10 0x0000564b6eee2227 in start_client (argv=<optimized out>, argc=1) at main.c:1580
#11 main (argc=<optimized out>, argv=<optimized out>) at main.c:1812

    
    ============== Backtrace #2 ==============

#0  0x00007f69b109b74d in __GI___select (nfds=nfds@entry=2, readfds=readfds@entry=0x7ffcd48d7cd0, writefds=writefds@entry=0x7ffcd48d7dd0, exceptfds=exceptfds@entry=0x7ffcd48d7d50, timeout=timeout@entry=0x7ffcd48d7ca0)
    at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000564b6ef0f544 in perform_io (needed=76, flags=flags@entry=4) at io.c:741
#2  0x0000564b6ef1070a in send_msg (code=code@entry=MSG_INFO, buf=buf@entry=0x7ffcd48d87e0 "[generator] expand file_list pointer array to 524288 bytes, did move\n", len=len@entry=69, convert=<optimized out>) at io.c:966
#3  0x0000564b6ef05d63 in rwrite (code=<optimized out>, code@entry=FCLIENT, buf=buf@entry=0x7ffcd48d87e0 "[generator] expand file_list pointer array to 524288 bytes, did move\n", len=69, is_utf8=<optimized out>, is_utf8@entry=0) at log.c:339
#4  0x0000564b6ef063e5 in rprintf (code=code@entry=FCLIENT, format=format@entry=0x564b6ef3f368 "[%s] expand file_list pointer array to %s bytes, did%s move\n") at log.c:442
#5  0x0000564b6eee4445 in flist_expand (extra=<optimized out>, flist=0x564b6f41b3e0) at flist.c:309
#6  flist_expand (flist=0x564b6f41b3e0, extra=<optimized out>) at flist.c:287
#7  0x0000564b6eeec9bd in recv_file_list (f=3, dir_ndx=dir_ndx@entry=2960) at flist.c:2584
#8  0x0000564b6ef13243 in wait_for_receiver () at io.c:1699
#9  0x0000564b6ef0f934 in wait_for_receiver () at io.c:1677
#10 perform_io (needed=89, flags=flags@entry=4) at io.c:862
#11 0x0000564b6ef1070a in send_msg (code=code@entry=MSG_INFO, buf=buf@entry=0x7ffcd48da8d0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0 is uptodate\n", len=len@entry=82, convert=<optimized out>) at io.c:966
#12 0x0000564b6ef05d63 in rwrite (code=<optimized out>, code@entry=FCLIENT, buf=buf@entry=0x7ffcd48da8d0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0 is uptodate\n", len=82, is_utf8=<optimized out>, is_utf8@entry=0) at log.c:339
#13 0x0000564b6ef063e5 in rprintf (code=code@entry=FCLIENT, format=format@entry=0x564b6ef3fba5 "%s is uptodate\n") at log.c:442
#14 0x0000564b6eeee433 in set_file_attrs (fname=fname@entry=0x7ffcd48de1c0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0", file=file@entry=0x564b702ccae8, sxp=<optimized out>, sxp@entry=0x7ffcd48dc010, 
    fnamecmp=fnamecmp@entry=0x0, flags=<optimized out>) at rsync.c:661
#15 0x0000564b6eef3b78 in recv_generator (fname=fname@entry=0x7ffcd48de1c0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0", file=file@entry=0x564b702ccae8, ndx=44459, itemizing=itemizing@entry=1, code=code@entry=FLOG, 
    f_out=f_out@entry=1) at generator.c:1805
#16 0x0000564b6eef4c95 in generate_files (f_out=f_out@entry=1, local_name=local_name@entry=0x0) at generator.c:2318
#17 0x0000564b6ef01f7c in do_recv (f_in=<optimized out>, f_in@entry=0, f_out=f_out@entry=1, local_name=local_name@entry=0x0) at main.c:1106
#18 0x0000564b6ef026cc in do_server_recv (argv=<optimized out>, argc=<optimized out>, f_out=1, f_in=0) at main.c:1219
#19 start_server (f_in=f_in@entry=0, f_out=f_out@entry=1, argc=<optimized out>, argv=<optimized out>) at main.c:1253
#20 0x0000564b6ef0281b in child_main (argc=<optimized out>, argv=<optimized out>) at main.c:1226
#21 0x0000564b6ef232b9 in local_child (argc=2, argv=argv@entry=0x7ffcd48df490, f_in=f_in@entry=0x7ffcd48df3f0, f_out=f_out@entry=0x7ffcd48df3f4, child_main=child_main@entry=0x564b6ef02800 <child_main>) at pipe.c:166
#22 0x0000564b6eee21d2 in do_cmd (f_out_p=0x7ffcd48df3f4, f_in_p=0x7ffcd48df3f0, remote_argc=<optimized out>, remote_argv=<optimized out>, user=0x0, machine=<optimized out>, cmd=<optimized out>) at main.c:651
#23 start_client (argv=<optimized out>, argc=1) at main.c:1569
#24 main (argc=<optimized out>, argv=<optimized out>) at main.c:1812


        ============== Backtrace #3 ==============

#0  0x00007f69b109b74d in __GI___select (nfds=nfds@entry=2, readfds=readfds@entry=0x7ffcd48d7cd0, writefds=writefds@entry=0x7ffcd48d7dd0, exceptfds=exceptfds@entry=0x7ffcd48d7d50, timeout=timeout@entry=0x7ffcd48d7ca0)
    at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x0000564b6ef0f544 in perform_io (needed=76, flags=flags@entry=4) at io.c:741
#2  0x0000564b6ef1070a in send_msg (code=code@entry=MSG_INFO, buf=buf@entry=0x7ffcd48d87e0 "[generator] expand file_list pointer array to 524288 bytes, did move\n", len=len@entry=69, convert=<optimized out>) at io.c:966
#3  0x0000564b6ef05d63 in rwrite (code=<optimized out>, code@entry=FCLIENT, buf=buf@entry=0x7ffcd48d87e0 "[generator] expand file_list pointer array to 524288 bytes, did move\n", len=69, is_utf8=<optimized out>, is_utf8@entry=0) at log.c:339
#4  0x0000564b6ef063e5 in rprintf (code=code@entry=FCLIENT, format=format@entry=0x564b6ef3f368 "[%s] expand file_list pointer array to %s bytes, did%s move\n") at log.c:442
#5  0x0000564b6eee4445 in flist_expand (extra=<optimized out>, flist=0x564b6f41b3e0) at flist.c:309
#6  flist_expand (flist=0x564b6f41b3e0, extra=<optimized out>) at flist.c:287
#7  0x0000564b6eeec9bd in recv_file_list (f=3, dir_ndx=dir_ndx@entry=2960) at flist.c:2584
#8  0x0000564b6ef13243 in wait_for_receiver () at io.c:1699
#9  0x0000564b6ef0f934 in wait_for_receiver () at io.c:1677
#10 perform_io (needed=89, flags=flags@entry=4) at io.c:862
#11 0x0000564b6ef1070a in send_msg (code=code@entry=MSG_INFO, buf=buf@entry=0x7ffcd48da8d0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0 is uptodate\n", len=len@entry=82, convert=<optimized out>) at io.c:966
#12 0x0000564b6ef05d63 in rwrite (code=<optimized out>, code@entry=FCLIENT, buf=buf@entry=0x7ffcd48da8d0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0 is uptodate\n", len=82, is_utf8=<optimized out>, is_utf8@entry=0) at log.c:339
#13 0x0000564b6ef063e5 in rprintf (code=code@entry=FCLIENT, format=format@entry=0x564b6ef3fba5 "%s is uptodate\n") at log.c:442
#14 0x0000564b6eeee433 in set_file_attrs (fname=fname@entry=0x7ffcd48de1c0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0", file=file@entry=0x564b702ccae8, sxp=<optimized out>, sxp@entry=0x7ffcd48dc010, 
    fnamecmp=fnamecmp@entry=0x0, flags=<optimized out>) at rsync.c:661
#15 0x0000564b6eef3b78 in recv_generator (fname=fname@entry=0x7ffcd48de1c0 ".config/_some_application__that_I_use_/Cache/Cache_Data/7543a11227262604_0", file=file@entry=0x564b702ccae8, ndx=44459, itemizing=itemizing@entry=1, code=code@entry=FLOG, 
    f_out=f_out@entry=1) at generator.c:1805
#16 0x0000564b6eef4c95 in generate_files (f_out=f_out@entry=1, local_name=local_name@entry=0x0) at generator.c:2318
#17 0x0000564b6ef01f7c in do_recv (f_in=<optimized out>, f_in@entry=0, f_out=f_out@entry=1, local_name=local_name@entry=0x0) at main.c:1106
#18 0x0000564b6ef026cc in do_server_recv (argv=<optimized out>, argc=<optimized out>, f_out=1, f_in=0) at main.c:1219
#19 start_server (f_in=f_in@entry=0, f_out=f_out@entry=1, argc=<optimized out>, argv=<optimized out>) at main.c:1253
#20 0x0000564b6ef0281b in child_main (argc=<optimized out>, argv=<optimized out>) at main.c:1226
#21 0x0000564b6ef232b9 in local_child (argc=2, argv=argv@entry=0x7ffcd48df490, f_in=f_in@entry=0x7ffcd48df3f0, f_out=f_out@entry=0x7ffcd48df3f4, child_main=child_main@entry=0x564b6ef02800 <child_main>) at pipe.c:166
#22 0x0000564b6eee21d2 in do_cmd (f_out_p=0x7ffcd48df3f4, f_in_p=0x7ffcd48df3f0, remote_argc=<optimized out>, remote_argv=<optimized out>, user=0x0, machine=<optimized out>, cmd=<optimized out>) at main.c:651
#23 start_client (argv=<optimized out>, argc=1) at main.c:1569
#24 main (argc=<optimized out>, argv=<optimized out>) at main.c:1812

===

Notice that the expected error message is being printed in Stack Backtrace #2, and that recv_file_list() and flist_expand() are on that stack.

These rsync instances were being invoked with, more or less, the following command line options, the way I had rsnapshot configured to invoke it:

rsync -av --delete-during --archive --verbose --one-file-system --hard-links --xattrs --sparse src_dir dest_dir

I will see if I can make further progress debugging this, and make another post here if I do.

0

I have a large directory with 41621 directory entries.

When doing an rsync copy of that directory to another local disk, as described above in my previous long comment a few days ago, the rsync process known within rsync as the "sender" reads that directory, about 819 entries at a time, and sends what it sees to another rsync process known within rsync as the "receiver".

It should take the sender about 50 getdents64() system calls to read 41621 directory entries, 819 entries at a time.

After examining each of the 819 directory entries returned to it in typically 2 or 3 getdents64() system calls, the sender then sends a nearly 64 kbyte compressed body of information about those directory entries to the receiver.

The sender sends this data by writing almost 64 kbytes at a time over a socket to the receiver. The sender sends almost 65536 (2^16) bytes at a time over a local socket to the receiver, and the receiver reads almost 32768 bytes at a time.

Fairly often the receiver will handle the big write from the sender using 3 reads, the first two of which almost, but not quite, read what was just send, and a 3rd read for just the few final bytes.

The above processing of my big directory does not quite complete however.

After the sender has read the results of 45th and 46th getdents64() calls and sent them over to the receiver, the receiver fails to read most of that, and instead only reads just a few bytes of the nearly 64k bytes just sent it. Then the receiver, as usual, invokes pselect6() to see if there is more for it read from the sender, and the kernel does not respond to that pselect6(), until its 60 second timer expires ... indicating that there is nothing more for the receiver to read on that socket ... this despite the sender having just sent it another nearly 64k byte packet.

That leaves a bunch of unread data on that socket, and then when the sender is ready to send what it learned from its 47th getdent64() call on that large directory, it first, as usual, checks using pselect6() on its socket to the receiver whether or not there is space for it to write more, and that pselect6() also blocks until the (same duration) 60 second timeout, indicating to the sender that that socket is too full of existing data, and can't accept more until the task reading it reads some data from it.

We then end up with both the sender and the receiver tasks (as well as the third rsync task, known as the "generator", which doesn't appear to be involved in this hang) all looping on pselect6() 60 second timeouts, all being told there is nothing to read, and no room to send any more writes, on the various sockets they share.

The rsync message that (if the verbose option is set) appears just before this interruptible hang:

"[sender] expand file_list pointer array"

appears now to me to be incidental to this hang. That same message appears several times in the long rsync run that I am using to recreate this problem. It indicates that rsync has chosen to expand the size of its dynamically allocated "file_list pointer array", perhaps because it needed a bigger buffer, in this case, to handle this huge directory. I see no indication that this message is a sign of the problem causing the hang, that occurs a few getdents64() calls later in the session.

The above analysis comes from spending a few hours, over a couple of days, examining the 200 Mbyte 'strace' output from running the rsnapshot job that invoked this rsync task. I've been expert in strace and the underlying ptrace system call in Unix/Linux, for over 40 years.

===

P.S. -- I've further studied the strace output, as well as the corresponding rsync source code (where 'len' is set, just before the read() system call, in the perform_io() routine in io.c), and so far as I can tell now, the FIRST sign of anything amiss is a failure of the pselect6() call by the recipient to return results showing that there is data available to be read on the recipient's fd == 0, which data was placed into the socket feeding that fd, immediately before, by the sender.

So my current working hypothesis is that this is a kernel bug.

That the kernel returns (at least as seen by ptrace) errno ERESTARTNOHAND to all three (sender, recipient and generator) rsync processes, when I send them a SIGINT, after they've hung, is another sign of a kernel "weirdness" that has long been noticed, but unresolved. Perhaps this new suspicious pselect6() failure to report available data to read is related to that long standing ERESTARTNOHAND weirdness. Perhaps not.

Years ago, I would have debugged this suspect kernel behaviour myself; but I no longer have that skill set, so will have to see if I can produce a quality C program that displays the problem (if my current working hypothesis holds up).

===

P.P.S. -- Aha - the 3rd rsync process, the one called the 'generator' ALSO stalled out on a pselect6() call, just like the 'receiver' process did.

And BOTH of them stalled out, hanging on a pselect6() they should not have hung on, just after doing the SECOND call to expand their buffer, and hence just after issuing their SECOND associated 'expand file_list pointer array' verbose message.

Note also that the original poster to this thread reported he was also backing up a huge directory ... he gave the same in total space taken by the files in that directory, but I would not be surprised, if as someone else noted above, it mattered more how many directory entries were in that directory.

P.P.P.S -- The following little shell script reproduces the problem. Having double (or more I presume) verbose flags (-vv) on rsync in this case results in a hang, whereas just a single verbose (-v) flag succeeds fine.

#!/bin/sh

# build two big test directories
mkdir -p bigdirs/1 bigdirs/2 newdirs
cd ./bigdirs/1
seq 55555 | xargs touch
cd ../2
seq 55555 | xargs touch
cd ../..

# successfuly rsync the test, with single verbose rsync
echo invoking rsync with single verbose - should succeed quickly.
rsync -a -v bigdirs/ newdirs 1>single_verbose_fd1.out 2>single_verbose_fd2.out
ls -lt single_verbose_fd1.out single_verbose_fd2.out

# hang trying same rsync with double verbose
echo
echo invoking double verbose rsync and expecting it to hang
echo interrupt with control-C or such to end the hang
rsync -a -vv bigdirs/ newdirs 1>double_verbose_fd1.out 2>double_verbose_fd2.out

Well ... I went to file a bug against rsync for the above hang, and almost the first thing they said at https://rsync.samba.org/issues.html was "don't use more than one verbose flag". It fills up the communication lines between the various rsync tasks. I agree, it does. That's why I wrote the above little test case showing a hang with two or more verbose flags ... I was seeing the communication sockets between the rsync tasks filling up with all the verbose messages and then blocking, causing the hang.

So I may still file a bug at https://rsync.samba.org/issues.html, but I might not get much sympathy from them .

--

U.V. [Ultimum Verbum, the last word]

My above test case demonstrates a hang using the rsync version 3.2.3 that is in the Ubuntu version 22.04.1 LTS (Jammy Jellyfish) that I am running.

However with the latest rsync release of 3.2.7, downloaded and rebuilt from https://rsync.samba.org, this test case no longer fails. The hang (at least the one the above shell script demonstrates) has been fixed.

Thank-you, rsync :).

I will file no bug reports.

0

I had the same problem and managed to get around the problem by doing the transfer remotely. For example, instead of doing rsync -avvxAX source/ destination/, I did rsync -avvxAX source/ root@localhost:destination/

rsync probably have specific optimizations for local transfers. Until we can figure out the exact cause of the problem, this workaround fixes the problem for me.

My system: Gentoo Linux, kernel 4.3.0, systemd, rsync 3.1.2 protocol version 31

Priyadi
  • 101