5

I have an rsync backup script to transfer data between two Ubuntu servers (located in different countries). The data being backed up is quite large in terms of number of files. It is about 17GB in size totally. The script runs on the receiver server. So, it is basically a pull. Public-private key authentication used for login.

The script works fine; the backup has been happening successfully for many months now.

Lately, for the past 6 days or so, the backups have not been completed. The rsync process runs for about 45 minutes or so. And then just ends. I have no idea why it stops. From what I can see, it does not even complete building and scanning the file list. I have the cron output directed to a log file. In the log, all I see is: receiving file list ... done. But I can see that nothing has been transferred into the backup destination.

If I run the script manually, after about 45 minutes, I just see this: ./sync.sh: line 51: 9078 Killed $RSYNC $OPTIONS $SOURCE $DESTINATION

How and where do I see the reason for the failure? How do I know which server is actually killing the process, the sender or the receiver?

The pulling machine (where the script runs) is a low-end-box. It is a KVM VM with 256MB of RAM. So, I am wondering if the building of the file structure is taking up too much RAM, thus causing an OOM error. But how do I check if this is the case? Moreover, there has been no significant increase in files for it to cause the sudden failing.

Any tips would be appreciated.

Thanks.

Update 1

As suggested by @APZ, I added a couple more verbose flags (3 in total) and ran the script manually, redirecting the output to a file. Here is the output at the end:

(.... lots of file names....)
received 5795917 names
done
recv_file_list done
get_local_name count=5795917 /storage/  <======== Reached here after about 40 minutes. Was stuck here for about 10 minutes or so.
[Receiver] _exit_cleanup(code=14, file=main.c, line=788): about to call exit(14)

rsync: fork failed in do_recv: Cannot allocate memory (12)
rsync error: error in IPC code (code 14) at main.c(788) [Receiver=3.0.9]

To answer @TimHaegele, as far as I know, the VM host (Prometeus / IperWeb) does not do any limiting of CPU, IO or anything. I could ask them, though. They are extremely highly rated.

My Ubuntu installation on the VM has 512 MB swap configured. Maybe I can increase that to say 2 GB or so? Disk space is not a problem.

When rsync is running, this is the output of free -m:

             total       used       free     shared    buffers     cached
Mem:           239        236          2          0          0          3
-/+ buffers/cache:        232          7
Swap:          511        510          1

Based on this evidence, would it still make a difference to change the SSH Daemon settings, as suggested?

Update 2

The consensus seems to be that low memory is the issue. So, I added a new swap file of 2GB and activated it. So, now I have a total of 2.5 GB of swap.

Then, I ran the script again (manually). This time, it ran for more than 90 minutes. It was transferring the files by this time. But then suddenly, the process quit. In the logs, I see that it terminated with the following error:

Invalid packet at end of run (4330026) [sender]
[generator] _exit_cleanup(code=12, file=io.c, line=1532): about to call exit(12)
rsync error: protocol incompatibility (code 2) at main.c(695) [sender=3.0.7]
rsync: writefd_unbuffered failed to write 23 bytes to socket [generator]: Broken pipe (32)
rsync error: error in rsync protocol data stream (code 12) at io.c(1532) [generator=3.0.9]
[receiver] _exit_cleanup(code=19, file=main.c, line=1316): about to call exit(19)
rsync error: received SIGUSR1 (code 19) at main.c(1316) [receiver=3.0.9]

As you can see, the sender machine has 3.0.7 and the receiver (puller) has 3.0.9 . I don't quite get what the error is.

Meanwhile, I saw @APZ's comment and I have modified my script to replace --delete-after with --delete-delay. I am running it again now. Will get back with updates.

Update 3

Adding more swap and using --delete-delay instead of --delete-after seems to have done the trick. The regular cron job seems to be running properly as well.

Also, I have followed this article to make rsync run with sudo on the sending machine. This has also removed the Permission denied (13) warnings during the transfer.

Thanks for the help, everyone.

P.S.: Everybody who participated in this Q&A gave helpful suggestions. Unfortunately, I can only mark one correct answer.

Anjan
  • 307
  • 1
  • 2
  • 14

3 Answers3

3

As pointers, i would suggest looking in to rsync logs on server side. Also, try the verbose mode of rysnc:

-v, --verbose This option increases the amount of information you are given during the transfer. By default, rsync works silently. A single -v will give you information about what files are being transferred and a brief summary at the end. Two -v options will give you information on what files are being skipped and slightly more information at the end. More than two -v options should only be used if you are debugging rsync.

APZ
  • 954
  • 2
  • 12
  • 25
  • I have updated the question with more details, based on your suggestions. – Anjan Jan 30 '13 at 08:57
  • 2
    It seems to be running out of memory.(from rsync docs)Usually out of memory when running rsync happens when you are transferring a very large number of files. The size of the files doesn't matter, only the total number of files. First try to use the incremental recursion mode: upgrade both sides to rsync 3.0.0 or newer and avoid options that disable incremental recursion (e.g., use --delete-delay instead of --delete-after). If this is not possible, you can break the rsync run into smaller chunks operating on individual subdirectories using --relative and/or exclude rules. – APZ Jan 30 '13 at 16:47
  • Thank you for the info and the pointer to the [FAQs](http://rsync.samba.org/FAQ.html). I have edited my script to replace `--delete-after` with `--delete-delay`. I have edited the question to add updates. – Anjan Jan 30 '13 at 19:14
2

Is the KVM VM where the rsync script runs controlled by a Hoster which limits ressources like IO, CPU-Time etc?

Trying to answer your Question I suggest:

Run sync.sh on a host with more ressources than 256MB and controlled by your own and see if it runs sucessfully. If yes, the source of your Problem is the client.

Seconde, and a bit obscure, but worth a test run it at different time.

In addition to shorten timeouts:

Use a more aggressive disconnect Setting in /etc/ssh/sshd_config on the server like:

ClientAliveInterval 5
ClientAliveCountMax 3
Tim Haegele
  • 951
  • 6
  • 13
0

Even with rsync --verbose, these were the final lines of the output:

rsync: [sender] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [sender=3.2.3]
rsync error: received SIGUSR1 (code 19) at main.c(1612) [generator=3.2.3]

Turns out my system was running out of space on the destination (20 MB free on a 120 GB APFS volume, a quick way to check is df -h).

  • In a pinch, you can also try rsync --delete-before to free up space.)
  • (System is macOS 12 running rsync 3.2.3, installed from homebrew. rsync job is from internal drive to external USB drive, which was what was out of space.)

Nothing immediately obvious from the error messages, though. Googling for rsync + SIGUSR1 points to this question, so this may have been the issue in OP's "Update 2".

chronospoon
  • 601
  • 5
  • 4