a perl script faster than rsync for images and partitions, that produces two-way diffs

Question

sysadmin1138 and Martin have reported a replacement for rsync that works on block devices (partitions). It is based on perl, but I want to store two-way diffs.

It applies changes in a block device to a preexisting outdated backup image. This is the second best to do that, after lvmsync that I did not use because my block device is not in lvm.

But I wanted also to collect separately the changes, in order to be able to regenerate the previous backup image (e.g., to recover a deleted file).

The following code does collect these changes, when the rsync remplacement runs:

patch=diff.`date +'%Y%m%d.%H%M%S.%N'`.gz
ssh $username@$backupnas "perl -'MDigest::MD5 md5' -ne   "\
"        'BEGIN{\$/=\1024};print md5(\$_)' $remotepartition        "\
" | gzip -c                                              "\
|gunzip -c|LANG= tee >(wc -c|LANG= sed '1s%^%number of 64 bytes blocs: %' >&2) \
|LANG= perl -'MDigest::MD5 md5' -e 'open DISK,"'"<$partition"'" or die $!; '\
'         while( read DISK,$read,1024)                                     '\
'         {                                                                '\
'           read STDIN,$md,16;                                             '\
'           if($md eq md5($read)) {print "s"} else {print "c" . $read }    '\
'         }                                                                '\
| gzip -c                                                                       \
|ssh $username@$backupnas "touch $remotepartition;LANG= tee -a $patch|gunzip -c"\
"     |perl -e 'open REVP,\"| gzip -c > rev.$patch\";                          "\
"         open PREVIOUS,\"<$remotepartition\";                                 "\
'         $rev = "PREVIOUS met EOF if length<1024."; $rev=$rev.$rev;           '\
'         $rev=$rev.$rev.$rev.$rev; $rev=$rev.$rev.$rev.$rev;                  '\
'         while(read STDIN,$read,1)                                            '\
'         {                                                                    '\
'           if ($read eq "s")                                                  '\
'           {                                                                  '\
'             if (length($rev) eq 1024) { print REVP "s" } ;                   '\
'             $s++                                                             '\
'           } else {                                                           '\
'             if ($s) { seek STDOUT,$s*1024,1; seek PREVIOUS,$s*1024,1; $s=0}; '\
'             if (read PREVIOUS,$rev,1024) { print REVP "c".$rev };            '\
'             read STDIN,$buf,1024;                                            '\
'             print $buf                                                       '\
'           }                                                                  '\
"         }' 1<> $remotepartition                                              "

$rev is initialized to a scalar string of length 1024 (I don't know how to make it better).

Without the formatting and with more or die, this is:

patch=essai_delta.`date +'%Y%m%d.%H%M%S.%N'`.gz
ssh username@backupnas "perl -'MDigest::MD5 md5' -ne 'BEGIN{\$/=\1024};print md5(\$_)' essai_backup | gzip -c" | \
gunzip -c | LANG= tee >(wc -c|LANG= sed '1s%^%bin/backup_essai: number of 64 bytes blocs treated : %' >&2) | \
LANG= perl -'MDigest::MD5 md5' -e 'open DISK,"</data/data/com.spartacusrex.spartacuside/files/essai" or die $!; while( read DISK,$read,1024) { read STDIN,$md,16; if($md eq md5($read)) {print "s"} else {print "c" . $read } }' /data/data/com.spartacusrex.spartacuside/files/essai | \
gzip -c | \
ssh username@backupnas "LANG= tee -a $patch | gunzip -c | perl -e 'open REVP,\"| gzip -c > rev.$patch\" or die \$!; open READ,\"<essai_backup\" or die \$!; \$rev = \"if length<1024, EOF met in READ.\"; \$rev=\$rev.\$rev.\$rev.\$rev; \$rev=\$rev.\$rev.\$rev.\$rev; \$rev=\$rev.\$rev; while(read STDIN,\$read,1) { if (\$read eq \"s\") {if (length(\$rev) eq 1024) { print REVP \"s\" or die \$! } ; \$s++} else { if (\$s) { seek STDOUT,\$s*1024,1 or die \$!; seek READ,\$s*1024,1 or die \$!; \$s=0}; if (read READ,\$rev,1024) { print REVP \"c\".\$rev or die \$! } else { print STDERR \$!}; read STDIN,\$buf,1024 or die \$!; print \$buf  or die \$!} }' 1<> essai_backup"

To apply the forward or backward diff, I can use:

ssh username@backup_nas "LANG= cat diff_delta.20141202.110302.0935 | gunzip -c | perl -ne 'BEGIN{\$/=\1} if (\$_ eq\"s\") {\$s++} else {if (\$s) { seek STDOUT,\$s*1024,1; \$s=0}; read STDIN,\$buf,1024; print \$buf}' 1<> image.file"

So I succeeded to answer first version of this post. This was tested on an example of 200k with some modifications.

I have specific questions about this code.

Why did the original example used read ARGV, is it bad practice ?

I have put many or die $!, is it wise or does it just destroy readability ?

PREVIOUS and STDOUT are the same file opened twice (to avoid seek STDOUT,-1024,1), is it considered good practice ?

[question migrated manually from programmers.so]

`ssh username@backup_nas "LANG= tee -a $patch | gunzip -c | perl -ne 'BEGIN{\$/=\1} ; if (\$_ eq\"s\") {print ARGV \"s\"; \$s++} else {if (\$s) { seek STDOUT,\$s*1024,1; \$s=0}; read STDOUT,\$previous,1024; seek STDOUT,-1024,1; print ARGV \"c\" . \$previous; read STDIN,\$buf,1024; print \$buf}' 1<> essai_backup >(gzip -c > reverse.$patch)"` is all I could make and it does not work yet. — user2987828, Apr 07 '14 at 16:21
Your first snippet is a 10 command pipeline and most of it isn't perl and isn't directly generating or applying the diff but is managing it. Edit down your question to include just the perl that you are having a problem with and show us what you've done to try to make it work. Explain what you expect it to do vs what it is doing. — benrifkah, Apr 07 '14 at 17:14
@benrifkah, I tried more to make it work, and also succeeded in having a version that seems to work. The shell part is necessary because this is communication between two computers. The last perl part alone would be just the thing applying this particular format of diff, but I doubt it would be useful to anyone. — user2987828, Apr 08 '14 at 08:37
It sounds like you got it working. If so then you are [very welcome to submit your own answer.](https://stackoverflow.com/help/self-answer) However, before you do it would be a good idea to edit your question down to a single issue so that it is useful if others have the same problem in the future. — benrifkah, Apr 08 '14 at 18:48
Your questions about best practices and readability are better suited for [Code Review](http://codereview.stackexchange.com/). — ThisSuitIsBlackNot, Apr 08 '14 at 18:55
If one or two different answer of at least same quality happen, I will 200 rep to each of them also. — user2987828, Apr 09 '14 at 10:26

Gene · Answer 1 · 2014-04-09T15:31:11.880

Why did the original example used read ARGV, is it bad practice ?

This is a religious question. For one-line SSH hacks like this, it's more-or-less fine if you and the people likely to be maintaining them are very good at perl idioms. Common wisdom though is that new perl code should use strict; and employ conventions that read more intuitively. The fact that you had to ask about bare ARGV and be referred to an obscure perlmonk article is exactly why. I'd look for an opportunity to distribute well-written, readable scripts to a standard place on the target machine and then run them remotely with simple ssh commands. On the other hand, the way above is great for job security.

I have put many or die $!, is it wise or does it just destroy readability ?

It is always handy to know why a script died rather than get the obscure default error trace. The readability issue is only that you're using this broken technique of putting a fairly large script in an ssh command. As suggested above, if you set yourself up with a saner environment, adding or die $! will not hurt readability at all. It will enhance it by showing where you expect errors may occur.

PREVIOUS and STDOUT are the same file opened twice (to avoid seek STDOUT,-1024,1), is it considered good practice ?

Opening two descriptors on the same file in the same thread is not bad practice if the OS allows it, which most will. It's a little obscure, so it needs a comment. This is another thing you can do if you avoid in-line script.

What is really strange practice is the way the buffer for $rev is built up as a string by repeated concatenation to get 1024 characters. This is unnecessary. You can just say $rev = ''; and the string's length will be expanded automatically to the input size by read. If you really want to pre-allocate, just say $rev = '-' x 1024;.

Addition

I just learned about a nice feature of bash. Its printf with the %q format specifier will add bash escapes to any string. With this, you could write escape-free bash and/or perl code and then say

ssh $username@$backupnas "$(printf "%q" $(cat script.bash))"

Thank you for your answer, it will likely get the 200 rep. PREVIOUS and STDOUT are both on the file $remotepartition, and STDIN is the forward patch uncompressed. I use PREVIOUS to read, and STDOUT to write, in the same file; but I don't know if it will break something. My first version used ˋread STDOUT,$rev,1024;seek STDOUT,-1024,1; print STDOUT $bufˋ instead of ˋread PREVIOUS,$rev,1024; print STDOUT $bufˋ. Thanks for the trick ˋ'-' x 1024ˋ (it is ˋxˋ or ˋ*ˋ ?) and $rev indeed has to be initialized with length 1024. — user2987828, Apr 09 '14 at 09:59
ARGV and the perl code embedded in ssh are bad practice, ok. I am nor payed for that script neither am a perl monk, this is my first perl code. I will still keep the perl code embedded, because I then do not need to maintain that perl script on each host harbouring a $remotepartition. scp might also send that perl code, but this will result in three files to maintain, and to download by others users. Here I have a dark habit, I prefer to maintain in the same file any information that cannot be used without that file (e.g. I use subfunctions in Matlab). But I will now try to add ˋuse strict;ˋ. — user2987828, Apr 09 '14 at 10:11
I just decided to test and release two versions, the second with separate perl files. Did you, by chance, notice anything else that is a bad practice (or that could be easily made more secure) in my code ? — user2987828, Apr 09 '14 at 10:21
Thanks. In perl the repeat character operator is `x` rather than `*` as in some other languages. — Gene, Apr 09 '14 at 12:32
@user2987828 I added a note on the `%q` specifier for `printf` in `bash`, which I did not know about before. — Gene, Apr 09 '14 at 15:31

a perl script faster than rsync for images and partitions, that produces two-way diffs

1 Answers1