"tar czf" versus "tar cf - | gzip": are they different? (or how to improve a backup)

Question

I want to speed up my backup done with tar czf, the common way to do it. But day by day my backed up files grow so it becomes slower.

I was thinking to take advantage of the several cores available in my server and I was wondering if there is any difference between doing the backup with tar czf or piping tar to gzip: tar cf - | gzip

I guess that there isn't any difference, because the first spawns two processes (tar and gzip), in a similar way like piping it.

If there is not difference, do you know any good alternative to do this, without going incremental? I'm looking at pigz too and it looks fine.

Can someone with enough points edit this question? Using tar -cfz causes errors but tar -czf doesn't. — Eugene M, Aug 27 '10 at 11:32

score 4 · Accepted Answer · answered Mar 12 '10 at 12:24

When you say you want to take advantage of multiple cores the implication is that your tar with gzip is CPU bound and not IO bound, are you sure this is the case? If you are not sure you need to run sar, iostat, top, or check monitoring graphs etc to find out. Never a good idea to try to solve a problem with out understanding it first. Not saying this is the case with you for sure, but my guess would be that even though there is compression with gzip you would be more likely to be IO bound.

If it is IO bound, and you have multiple arrays, a separate process for each array might make sense.

I also second David's advice to consider incremental.

For me it is CPU and IO bound. I tried parallel bzip2 in my local machine and it was a flash to zip a huge file so I wondered if it was possible to apply a similar approach to servers. Anyway, I'm a limited user in those servers, I admin the application that run in them, not the OS, so I am not able to do changes without going through "bureaucracy". Finally, the servers have a high load (it is a huge system). Thanks for your comment, I didn't realize that I have bzip2 available. Maybe I can take a test with it. — Dario Castañé, Mar 13 '10 at 09:07

score 1 · Answer 2 · answered Mar 12 '10 at 11:42

1

You're unlikely to improve on the raw performance of tar and gzip by fiddling like this; in order to take better advantage of the hardware you could separate out folders into different parts and do multiple archives simultaneously.

Why do you not want to go incremental? I would recommend using rsnapshot even if you're doing this locally as it has the capability to use hard links to let you save disk space whilst still keeping exact copies from multiple times

answered Mar 12 '10 at 11:42

David Fraser

406
6
12

I am not able to go incremental because I am a limited user on those systems, I don't admin them except the application layer. Your comment and Kyle's one made me think about using bzip2 and to try to do "kind-of" incremental backup, backing up only the changed files every day, with a full backup at weekend. Thanks! I will answer the question when I try both options. – Dario Castañé Mar 13 '10 at 09:10

score 0 · Answer 3 · answered Jan 26 '11 at 19:23

0

If you are cpu bound (and not io bound!), you can use pigz. It will spread gzip over multiple cores. I use it for my backups. Its a drop in replacement for gzip:

tar cf - | pigz

answered Jan 26 '11 at 19:23

Steven

3,029
20
18

yup pigz is much faster, check out http://vbtechsupport.com/1576/ – p4guru Nov 18 '11 at 17:38

"tar czf" versus "tar cf - | gzip": are they different? (or how to improve a backup)

3 Answers3

Linked