How to stream files to tar using a chunking approach?

Question

I want to backup my system by invoking a tar'ing script over ssh that pipes back to stdout for the ssh-initiating host to then store the tar.

However, I want to perform logical dumps of some services running on that host, but do not have enough disk space there to dump these huge files on disk to then capture them from tar.

I do know, that tar cannot handle streams (or any files with unknown size). So I figured, I split the dumps while running into fixed-size chunks, store them on disk temporarily, bring them to tar for processing, and then delete them before processing the next chunk.

My script for this looks something like:

mkfifo filenames
tar --files-from filenames -cf - &
TAR_PID=$!
exec 100>filenames

# tar all relevant host-level directories/files
echo "/etc" >&100
echo "/root" >&100

function splitfilter() {
  tee $1

  (
    # wait for tar to finish reading the file and delete it after being processed
    inotifywait -e close_nowrite $1
    rm $1
  ) &
  RM_SHELL_PID=$!

  # send the filename for processing to tar
  echo $1 >&100

  wait $RM_SHELL_PID
}
export -f splitfilter

# perform the logical dumps of my services
dump_program_1 | split -b 4K /var/backup/PREFIX_DUMP_1_ --filter "splitfilter $FILE"
dump_program_2 | split -b 4K /var/backup/PREFIX_DUMP_2_ --filter "splitfilter $FILE"

exec 100>&-
wait $TAR_PID
rm filenames

However, I cannot figure out why this randomly works and doesn't. I have observed two distinct failure behaviours so far:

tar not stopping. At the end of the script I do close the file descriptor, so I expect the fifo to signal EOF to tar. This should end the tar process rather quickly as it only needs to complete processing of the last 4k chunk (if not already finished). I cannot explain why it randomly hangs. The resulting archive is actually complete (except tar's EOF-marker)...
tar processing 0-byte files. After some time of processing, it seems inotifywait wakes up, before tar has closed the chunk file for reading. Thus resulting in the file being deleted and thus showing as a 0-byte-sized entry in the archive. I have mitigated this somehow by putting a sleep 1 after the echo $1 >&100 call. After that, the first couple chunks do actually get filled, but after some time running, the later chunks become 0-sized again. I feel a timing problem here somewhere, but can't see it currently.

After a day of debugging I am losing hope in this approach, but it WOULD be sooo good if it would work reliably: it could actually produce streamed tars! Don't get me wrong: it worked once or twice while debugging. I just cannot figure why not always

if you want that the list of files be read from stdin then use cpio instead of tar — Udi, Jul 23 '23 at 12:51
Sadly, I cannot change the archive format, since I need to stream it directly to borg backup. And that only supports reading from tar — Thysce, Jul 23 '23 at 12:57

Kaz · Answer 1 · 2023-07-24T07:20:34.483

The tar format is fairly simple. We can stream it ourselves with this TXR Lisp program.

Caveat: this doesn't handle long paths; it puts out only one header block per object.

The backup list consists of a mixture of paths and command entries.

Commands are executed, and their output chopped into 4K pieces which become numbered files. These are deleted as we go, so nothing accumulates.

Even when we write our own implementation of tar, we still have to do this because the format requires the size of every object to be known in advance and placed into the header. There is no way to stream arbitrarily long command output as a tar stream.

(defvar backup-list
  '("/etc"
    "/root"
    (:cmd "cat /proc/cpuinfo" "cpuinfo")
    (:cmd "lspci" "lspci")))

(defsymacro splsize 4096) ;; split size for commands

(defsymacro blsize 512) ;; tar block size: written in stone

(typedef tarheader
  (struct tarheader
    (name (array 100 zchar))
    (mode (array 8 zchar))
    (uid (array 8 zchar))
    (gid (array 8 zchar))
    (size (array 12 zchar))
    (mtime (array 12 zchar))
    (chksum (array 8 char))
    (typeflag char)
    (linkname (array 100 zchar))
    (magic (array 6 char))
    (version (array 2 char))
    (uname (array 32 zchar))
    (gname (array 32 zchar))
    (devmajor (array 8 zchar))
    (devminor (array 8 zchar))
    (prefix (array 155 zchar))))

(defmacro octfill (slot expr)
  ^(fmt "~,0*o" (pred (sizeof tarheader.,slot)) ,expr))

;; Dump an object into the archive.
;; Form a correct header, calculate the checksum,
;; put out a header block and for regular files,
;; put out data blocks.
(defun tar-dump-object (file-in stream : stat)
  (let* ((file (trim-path-seps file-in))
         (s (or stat (stat file)))
         (tf (ecaseql* (logand s.mode s-ifmt)
               (s-ifreg #\0)
               (s-iflnk #\2)
               (s-ifchr #\3)
               (s-ifblk #\4)
               (s-ifdir #\5)
               (s-ififo #\6)))
         (h (new tarheader
                 name (let* ((n (cond
                                  ((equal "/" file) ".")
                                  ((starts-with "/" file) [file 1..:])
                                  (t file))))
                        (if (eql tf #\5) `@n/` n))
                 mode (octfill mode (logand s.mode #o777))
                 uid (octfill uid s.uid)
                 gid (octfill gid s.gid)
                 size (octfill size (if (eql tf #\0) s.size 0))
                 mtime (octfill mtime s.mtime)
                 chksum (load-time (str 8))
                 typeflag tf
                 linkname (if (eql tf #\2) (readlink file) "")
                 magic "ustar "
                 version " "
                 uname (or (getpwuid s.uid).?name "")
                 gname (or (getgrgid s.gid).?name "")
                 devmajor (if (meql tf #\3 #\4)
                            (octfill devmajor (major s.rdev)) "")
                 devminor (if (meql tf #\3 #\4)
                            (octfill devminor (minor s.rdev)) "")
                 prefix ""))
         (hb (ffi-put h (ffi tarheader)))
         (ck (logand (sum hb) #x1FFFF))
         (bl (make-buf blsize))
         (nb (trunc (+ s.size blsize -1) blsize)))
    (set h.chksum (fmt "~,06o\xdc00 " ck))
    (ffi-put-into bl h (ffi tarheader))
    (put-buf bl 0 stream)
    (if (eql tf #\0)
      (with-stream (in (open-file file "rb"))
        (each ((i 0..nb))
          (fill-buf-adjust bl 0 in)
          (buf-set-length bl blsize)
          (put-buf bl 0 stream))))))

;; Output two zero-filled blocks to terminate archive.
(defun tar-finish-archive (: (stream *stdout*))
  (let ((bl (make-buf (* 2 blsize))))
    (put-buf bl 0 stream)))

;; Dump an object into the archive, recursing
;; if it is a directory.
(defun tar-dump-recursive (path : (stream *stdout*))
  (ftw path (lambda (path type stat . rest)
              (tar-dump-object path stream stat))))

;; Dump a command to the archive by capturing the
;; output into numbered temporary split files.
(defun tar-dump-command (command prefix : (stream *stdout*))
  (let ((bl (make-buf splsize))
        (i 0))
    (with-stream (s (open-command command "rb"))
      (while (plusp (fill-buf-adjust bl 0 s))
        (let ((name (pic `@{prefix}0###` (inc i))))
          (file-put-buf name bl)
          (tar-dump-object name stream)
          (remove-path name))))))

;; main: process the backup list to stream out the archive
;; on standard output, then terminate it.
(each ((item backup-list))
  (match-ecase item
    ((:cmd @cmd @prefix) (tar-dump-command cmd prefix))
    (`@file` (tar-dump-recursive file))))

(tar-finish-archive)

I don't have a regression test suite for this; I tested it manually by archiving individual objects of various kinds and doing comparisons of hex dumps between that and GNU tar, and then unpacking directory trees archived by this implementation, doing recursive diffs to the original tree.

However, I wonder whether the backup service you are using won't handle catenated archives. If it handles catenated archives, then you can just use multiple invocations of tar to produce the stream, and not have all these process coordination issues.

For a tar consumer to handle catenated archives, it just has to ignore all-zero blocks (not treat them as the end of the archive), but keep reading.

If the backup service is like this then you can basically do it along the lines of this:

(tar cf - /etc
 tar cf - /root
 dump_program_1 | \
   split -b 4K /var/backup/PREFIX_DUMP_1_ \
         --filter "tar cf - $FILE; rm $FILE"
 ...) | ... into backup service ...

I can't see any option in GNU Tar not to write the terminating zeros. It might be possible to write a filter to get rid of these:

tar cf - file | remove-zero-blocks

The not yet written remove-zero-blocks filter that reads 512 byte blocks through a block-oriented FIFO that is long enough to cover the blocking factor used by tar. It places a newly-read buffer into one end of the FIFO, and writes the oldest one that is bumped from the other end. When EOF is encountered, the FIFO is flushed, but all trailing 512 byte blocks that are zero are omitted.

That should defeat a backup service that refuses to ignore zero blocks.

How to stream files to tar using a chunking approach?

1 Answers1