The tar format is fairly simple. We can stream it ourselves with this TXR Lisp program.
Caveat: this doesn't handle long paths; it puts out only one header block per object.
The backup list consists of a mixture of paths and command entries.
Commands are executed, and their output chopped into 4K pieces which become numbered files. These are deleted as we go, so nothing accumulates.
Even when we write our own implementation of tar, we still have to do this because the format requires the size of every object to be known in advance and placed into the header. There is no way to stream arbitrarily long command output as a tar stream.
(defvar backup-list
'("/etc"
"/root"
(:cmd "cat /proc/cpuinfo" "cpuinfo")
(:cmd "lspci" "lspci")))
(defsymacro splsize 4096) ;; split size for commands
(defsymacro blsize 512) ;; tar block size: written in stone
(typedef tarheader
(struct tarheader
(name (array 100 zchar))
(mode (array 8 zchar))
(uid (array 8 zchar))
(gid (array 8 zchar))
(size (array 12 zchar))
(mtime (array 12 zchar))
(chksum (array 8 char))
(typeflag char)
(linkname (array 100 zchar))
(magic (array 6 char))
(version (array 2 char))
(uname (array 32 zchar))
(gname (array 32 zchar))
(devmajor (array 8 zchar))
(devminor (array 8 zchar))
(prefix (array 155 zchar))))
(defmacro octfill (slot expr)
^(fmt "~,0*o" (pred (sizeof tarheader.,slot)) ,expr))
;; Dump an object into the archive.
;; Form a correct header, calculate the checksum,
;; put out a header block and for regular files,
;; put out data blocks.
(defun tar-dump-object (file-in stream : stat)
(let* ((file (trim-path-seps file-in))
(s (or stat (stat file)))
(tf (ecaseql* (logand s.mode s-ifmt)
(s-ifreg #\0)
(s-iflnk #\2)
(s-ifchr #\3)
(s-ifblk #\4)
(s-ifdir #\5)
(s-ififo #\6)))
(h (new tarheader
name (let* ((n (cond
((equal "/" file) ".")
((starts-with "/" file) [file 1..:])
(t file))))
(if (eql tf #\5) `@n/` n))
mode (octfill mode (logand s.mode #o777))
uid (octfill uid s.uid)
gid (octfill gid s.gid)
size (octfill size (if (eql tf #\0) s.size 0))
mtime (octfill mtime s.mtime)
chksum (load-time (str 8))
typeflag tf
linkname (if (eql tf #\2) (readlink file) "")
magic "ustar "
version " "
uname (or (getpwuid s.uid).?name "")
gname (or (getgrgid s.gid).?name "")
devmajor (if (meql tf #\3 #\4)
(octfill devmajor (major s.rdev)) "")
devminor (if (meql tf #\3 #\4)
(octfill devminor (minor s.rdev)) "")
prefix ""))
(hb (ffi-put h (ffi tarheader)))
(ck (logand (sum hb) #x1FFFF))
(bl (make-buf blsize))
(nb (trunc (+ s.size blsize -1) blsize)))
(set h.chksum (fmt "~,06o\xdc00 " ck))
(ffi-put-into bl h (ffi tarheader))
(put-buf bl 0 stream)
(if (eql tf #\0)
(with-stream (in (open-file file "rb"))
(each ((i 0..nb))
(fill-buf-adjust bl 0 in)
(buf-set-length bl blsize)
(put-buf bl 0 stream))))))
;; Output two zero-filled blocks to terminate archive.
(defun tar-finish-archive (: (stream *stdout*))
(let ((bl (make-buf (* 2 blsize))))
(put-buf bl 0 stream)))
;; Dump an object into the archive, recursing
;; if it is a directory.
(defun tar-dump-recursive (path : (stream *stdout*))
(ftw path (lambda (path type stat . rest)
(tar-dump-object path stream stat))))
;; Dump a command to the archive by capturing the
;; output into numbered temporary split files.
(defun tar-dump-command (command prefix : (stream *stdout*))
(let ((bl (make-buf splsize))
(i 0))
(with-stream (s (open-command command "rb"))
(while (plusp (fill-buf-adjust bl 0 s))
(let ((name (pic `@{prefix}0###` (inc i))))
(file-put-buf name bl)
(tar-dump-object name stream)
(remove-path name))))))
;; main: process the backup list to stream out the archive
;; on standard output, then terminate it.
(each ((item backup-list))
(match-ecase item
((:cmd @cmd @prefix) (tar-dump-command cmd prefix))
(`@file` (tar-dump-recursive file))))
(tar-finish-archive)
I don't have a regression test suite for this; I tested it manually by archiving individual objects of various kinds and doing comparisons of hex dumps between that and GNU tar, and then unpacking directory trees archived by this implementation, doing recursive diffs to the original tree.
However, I wonder whether the backup service you are using won't handle catenated archives. If it handles catenated archives, then you can just use multiple invocations of tar
to produce the stream, and not have all these process coordination issues.
For a tar consumer to handle catenated archives, it just has to ignore all-zero blocks (not treat them as the end of the archive), but keep reading.
If the backup service is like this then you can basically do it along the lines of this:
(tar cf - /etc
tar cf - /root
dump_program_1 | \
split -b 4K /var/backup/PREFIX_DUMP_1_ \
--filter "tar cf - $FILE; rm $FILE"
...) | ... into backup service ...
I can't see any option in GNU Tar not to write the terminating zeros. It might be possible to write a filter to get rid of these:
tar cf - file | remove-zero-blocks
The not yet written remove-zero-blocks
filter that reads 512 byte blocks through a block-oriented FIFO that is long enough to cover the blocking factor used by tar
. It places a newly-read buffer into one end of the FIFO, and writes the oldest one that is bumped from the other end. When EOF is encountered, the FIFO is flushed, but all trailing 512 byte blocks that are zero are omitted.
That should defeat a backup service that refuses to ignore zero blocks.