If selected-images-to-copy.txt
is a list of files only (the last element of the path is always a file, not a directory) here's a solution to create the archive with proper directory rights:
EDIT: I added a better solution at the end while keeping the intermediate(s) solution(s) around, capitalizing on dave_thompson_085 's comments and thinking on what could be improved with the informations available.
As he wrote, (and as I didn't explain completely,) the important part of the solution is to use --no-recursion
. This allows to save all meta informations for each manually added directory in the path, up to the files themselves, without including all other unwanted directories and files that would be recursively added otherwise.
awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt > selected-images-to-copy-with-explicit-arborescences.txt
tar cf - --no-recursion -T selected-images-to-copy-with-explicit-arborescences.txt | pigz | pv | nc 1.1.1.1 2222
If you really want to do it on-the-fly, using bash's <()
construct:
tar cf - --no-recursion -T <(awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt) | pigz | pv | nc 1.1.1.1 2222
The awk command just reconstructs and adds the path, one directory level at a time up to the file itself.
That way any directory in the path of a file to save is also put in the archive, but with the --no-recursion
nothing else will happen. So every directory ownership before the file will be saved and restored correctly.
There's still a problem of performance you have to trade somewhere: there will be many repeating arborescences, so the 2nd tar will often redo a chown on the same base directory. You could sort -u the result of awk to remove all those duplicates, but then sort might take a very long time before giving the results and the transfer to start. With a short perl script that will store unique elements in memory (trade-off is memory usage but I doubt that's a problem) there's no need to sort to output unique entries without delay. So the solution becomes:
tar cf - --no-recursion -T <(awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt | perl -w -e 'use strict; my %unique; while (<>) { if (not $unique{$_}++) { print } }' ) | pigz | pv | nc 1.1.1.1 2222
EDIT: If the content of selected-images-to-copy.txt
is more or less a sorted list of files (the unsorted output of a find
[...] -type f
kind of command is good enough), here's a solution that doesn't need any memory usage (which might indeed have become a problem with hundreds of millions of entries)
It is good enough to just remember the last longest path, and compare it to the next path:
- either the next is not a prefix of the previous, meaning it's a new arborescence (or new file in the same arborescence) and has to be archived and in this case is designed the new "last longest path". If the intial list wasn't at least presented as a tree (as in at least a find
command output, or of course a sorted list), some begnin repetitions will appear.
- either it's a prefix (a substring matching from the 1st character), meaning it's a directory that was already seen since it's part of the path of the previous, and can be safely ignored.
I'm adding a trailing /
in the comparison to easily find that mnt/a/b/foo/
isn't a prefix of mnt/a/b/foobar
. With mnt/a/b/foobar/file4.png
and mnt/a/b/foo/file5.png
as input, the ownership of the directory mnt/a/b/foo
wouldn't have been restored without this trick. So the perl command is replaced with:
awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }'
This sample:
file1.png
mnt/a/b/file2.png
mnt/a/b/file3.png
mnt/a/b/c/foobar/file4.png
mnt/a/b/c/foo/file5.png
mnt/a/b/file6.png
mnt/a/b/d/file7.png
Through this filter:
awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' | awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }'
Gives those directories and files ready for tar --no-recursion
:
file1.png
mnt
mnt/a
mnt/a/b
mnt/a/b/file2.png
mnt/a/b/file3.png
mnt/a/b/c
mnt/a/b/c/foobar
mnt/a/b/c/foobar/file4.png
mnt/a/b/c/foo
mnt/a/b/c/foo/file5.png
mnt/a/b/file6.png
mnt/a/b/d
mnt/a/b/d/file7.png
So the solution with the whole pair of commands becomes (root already uses -p
and --same-owner
, and better drop bash's fancy <()
when a |
can work and easily allows to break the long line with \
for readability) :
# TARGET (extract):
$ nc -l -p 2222 | pigz -d | sudo tar xf - -C /
# SOURCE:
$ awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt | \
awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }' | \
tar cf - --no-recursion -T - | pigz | pv | nc -w 60 1.1.1.1 2222