Check the total content size of a tar gz file

Question

How can I extract the size of the total uncompressed file data in a .tar.gz file from command line?

Sorry for not being clear, I meant through command line shell. — Ztyx, May 01 '10 at 11:41

score 84 · Answer 1 · edited Jan 17 '22 at 10:08

84

This works for any file size:

zcat archive.tar.gz | wc -c

For files smaller than 4Gb you could also use the -l option with gzip:

$ gzip -l compressed.tar.gz
     compressed        uncompressed  ratio uncompressed_name
            132               10240  99.1% compressed.tar

edited Jan 17 '22 at 10:08

Antonio

19,451
13
99
197

answered Apr 30 '10 at 12:15

Matthew Mott

1,135
1
7
4

2

This gives me the size of the tar file including file meta data such as file names etc. I was looking for a way to only check the total size of the files. Anyway, the only way to do this seem to be to extract the tar-file and run a script on the extracted content. – Ztyx May 01 '10 at 11:46
1

Actually, this could be enough. You will also need space for folder inodes, which can vary for different filesystems. Also `tar -tf...` with counting real size **will run gzip -d** on full file, thus you will actually extract tar. **gzip -l** stated here will not extract, so it is quite fast. – Vadim Fint Nov 14 '12 at 11:01
2

In my case, this gives me an uncompress size which is smaller than the compressed and a negative ratio. – lefterav Feb 27 '14 at 14:01
45

Worth noting that the uncompressed size reported is modulo 2^32, which means this doesn't work for files greater than 4GB. Use this command instead: `zcat archive.tar.gz | wc -c` – nedned Mar 19 '14 at 01:30
it can use this command for multiple files. gzip -l compressed-2019-* – Soli Mar 04 '19 at 09:01
3

Thanks @nedned. I was wondering why a 2.9Gb tar.gz full of text data files was reporting a -36% compression ratio o_O. That seems like a silly bug. – naught101 Mar 25 '19 at 22:05
@naught101 It's a file format limitation and documented in the man page: "The gzip format represents the input size modulo 2^32" – sehe Dec 03 '20 at 23:32

score 44 · Accepted Answer · edited Aug 14 '12 at 22:33

44

This will sum the total content size of the extracted files:

$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc

The output is given in bytes.

Explanation: tar tzvf lists the files in the archive in verbose format like ls -l. sed and cut isolate the file size field. The second sed puts a + in front of every size except the first and paste concatenates them, giving a sum expression that is then evaluated by bc.

Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.

edited Aug 14 '12 at 22:33

Peter Berry

3
1

answered Jul 30 '12 at 12:32

Ztyx

14,100
15
78
114

33

Or a bit more concisely: `tar tzvf archive.tar.gz | awk '{s+=$3} END{print (s/1024/1024), MB}'`. – Rubens Mar 18 '14 at 02:17
Thanks, Rubens. This is perfect and simple. I did this for mine and it worked great: tar tzvf 20180731.tar.gz | awk '{s+=$3} END{print (s/1024/1024/1024) " GB"}'. I did have to put quotes around "MB" or "GB" to get that printed. – Tony B Aug 01 '18 at 20:31
Calculate top-level directories (and files) size: tar tzvf /tmp/root.tgz| sed 's/ \+/ /g' | cut -f3,6- -d' ' | cut -f1 -d'/' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' – Ilya Sheershoff Oct 05 '18 at 12:52
I saw sizes of 0,0 which breaks the pipe. Adding an additional sed 's/./,/g' helps. This replaces comma by dot, and then summing up can work – falkb Nov 05 '21 at 11:12
@Rubens that is the best answer. OP want to know what is the size of the file ACCORDING to tar, not once you extract it because can be defective `tar: Unexpected EOF in archive` – Smeterlink Aug 30 '22 at 21:18

score 33 · Answer 3 · answered Apr 11 '13 at 17:39

33

The command gzip -l archive.tar.gz doesn't work correctly with file sizes greater than 2Gb. I would recommend zcat archive.tar.gz | wc --bytes instead for really large files.

answered Apr 11 '13 at 17:39

swdev

2,941
2
25
37

2

I believe `gzip -l` doesn't work with file size greater than **4GB**, since gzip only uses 4 bytes to store the original file size. – kevin Mar 15 '15 at 09:10
1

In looking at the source for gzip.c it appears to be a off_t which is a signed 4 byte value so max is 2GB. – swdev Mar 16 '15 at 18:24
6

The gzip specification (https://www.ietf.org/rfc/rfc1952.txt) says the ISIZE field should be the original file size modulo 2^32, not sure why gzip uses a signed int... – kevin Mar 16 '15 at 19:11
1

Listing files greater than 4 GiB was fixed in gzip 1.12 (2022-04), [release notes](https://lists.gnu.org/archive/html/info-gnu/2022-04/msg00003.html). – Fofola Aug 26 '22 at 15:12

bfontaine · Answer 4 · 2018-12-11T09:41:00.820

10

I know this is an old answer; but I wrote a tool just for this two years ago. It’s called gzsize and it gives you the uncompressed size of a gzip'ed file without actually decompressing the whole file on disk:

$ gzsize <your file>

edited Dec 11 '18 at 09:41

answered Oct 25 '16 at 17:18

bfontaine

18,169
13
73
107

What does it improve over piping to `wc`? Piping also works on-the-fly, I think. – mxmlnkn Feb 04 '19 at 13:27
@mxmlnkn It’s at least twice faster, sometimes even more. On two sample 12GB files with different compression levels (one with random data - 11GB compressed; one with repeated data - 18MB compressed) `zcat|wc -l` took 60s and 42s while `gzsize` took 29s and 15s. – bfontaine Feb 04 '19 at 14:08

score 6 · Answer 5 · answered Jun 02 '13 at 23:45

6

Use the following command:

tar -xzf archive.tar.gz --to-stdout|wc -c

answered Jun 02 '13 at 23:45

elec3647

513
5
8

RaZieRSarE · Answer 6 · 2018-03-08T21:59:04.077

I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.

first, which is most faster?

[oracle@base tmp]$ time zcat oracle.20180303.030001.dmp.tar.gz | wc -c
    6667028480

    real    0m45.761s
    user    0m43.203s
    sys     0m5.185s

[oracle@base tmp]$ time gzip -dc oracle.20180303.030001.dmp.tar.gz | wc -c
    6667028480

    real    0m45.335s
    user    0m42.781s
    sys     0m5.153s

[oracle@base tmp]$ time tar -tvf oracle.20180303.030001.dmp.tar.gz
    -rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
    -rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp

    real    0m46.669s
    user    0m44.347s
    sys     0m4.981s

definitely, tar -xvf is the most faster, but ¿how to cancel executions after get header?

my solution is this:


[oracle@base tmp]$  time echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
    6667023572

    real    0m1.005s
    user    0m0.013s
    sys     0m0.066s

Headers? Your solution is way off depending on the file size and number of files. Try it against numerous files inside the archive instead of 2. Try it against smaller and larger tar.gz files. — B. Shea, Apr 06 '20 at 16:17

score -2 · Answer 7 · answered Jun 06 '11 at 09:02

-2

A tar file is uncompressed until/unless it is filtered through another program, such as gzip, bzip2, lzip, compress, lzma, etc. The file size of the tar file is the same as the extracted files, with probably less than 1kb of header info added in to make it a valid tarball.

answered Jun 06 '11 at 09:02

Tom S

13
1

5

There's a header of 512 bytes for each file inside the tarball, plus the inner files are padded to be a multiple of 512 bytes. This adds up to an average-case overhead of 768 bytes per file inside the tarball. – Sarah G Jan 09 '15 at 04:40
The point of tarballs is that they are smaller versions for transport just like zip files. – Nate T Dec 21 '20 at 12:13
@Nathan No, it's not. On the contrary, it was designed to have bigger data blocks as an average filesystem. TAR stands for tape archive, nowadays repurposed but still an archive file for bigger data blocks. And also nothing to do with transports, actually back then when it was designed modems did the compression. You can gzip TAR the same as you can gzip any other file. Tom's answer will give very useless size approximation, but it's the same method and the same size you get from 'gzip -l' answers and those have 66 and 27 votes while Tom got downvotes? Not fair. – papo Jan 14 '21 at 18:03
@papo My original comment was poorly worded, but the answer is still wrong. The size of a tar.gz file is not he same, and that is what the OP was asking about. I wrote "tarball" but meant "tar gz file." Tom didn't really give an answer, just some info about uncompressed tarballs, which is not what OP is asking about. That is likely the reason for the downvotes. You cannot just answer a "how do I?" question with a "you don't need to" answer we have no idea what the OP needs unless he or she states it in the question. – Nate T Jan 15 '21 at 20:08
@papo Seems like Tom S knew this answer may end up in the red. CYA alt account? Single activity accounts are common for questions, but for an answer? – Nate T Jan 16 '21 at 00:44
This may be irrelevant for the question, but I got some info I was looking for. Thanks – Aditya Kane Jul 11 '21 at 16:53

Check the total content size of a tar gz file

7 Answers7

Linked