Concat multiple .gz files, skipping header lines in all but the first file using Python

Question

I have basically the same question as this, but rather than using awk I'd like to use Python, assuming it's not substantially slower than using some other method. I was thinking of reading line by line and compressing on the fly, but then I came across this post, and it sounds like that would be a bad idea (compression not very efficient). I came across this nice-looking gzip built-in Python library, so I'm hoping there is some clean, fast, and efficient pythonic way to do this.

I want to go from this:

gzcat file1.gz
# header
1
2

to this:

# header
1
2
1
2
1
2
1
2

I have a few hundred files, and the total uncompressed is about 60 GB. The files are gzipped CSV files.

probably it's going to be way too slow in Python. i suggest trying `subprocess` module to call `gzcat` from within your Python code. simple example of that (with zcat): https://codebright.wordpress.com/2011/03/25/139/ profiling and related discussion: http://www.dalkescientific.com/writings/diary/archive/2020/09/16/faster_gzip_reading_in_python.html — mechanical_meat, Jan 28 '22 at 19:55

Mark Adler · Answer 1 · 2022-01-29T16:57:48.263

0

Since you need to remove the first line of each CSV file, you have no choice but to decompress all of the data and recompress it.

You can open a gzip output file to write the result with with gzip.open('all.gz', 'wb') as g:.

You can open each input file using with gzip.open('filen.gz', 'rb') as f:, and then x = f.readline() on that object to read the uncompressed data one line at a time. You then have the option to, for example, discard the first line from each file but the first.

For the lines you want to keep, you can write them to the output with g.write(x).

edited Jan 29 '22 at 16:57

answered Jan 28 '22 at 20:50

Mark Adler

101,978
13
118
158

I came across that approach in another post actually. The problem is that I don't want the final file to have a bunch of headers in it, just one header. Each input file has a header at the top, and I want the final output file to have only one header too. – user554481 Jan 28 '22 at 21:02
A CSV header? .. – Mark Adler Jan 28 '22 at 22:27
Yeah, a CSV header – user554481 Jan 29 '22 at 14:10

Concat multiple .gz files, skipping header lines in all but the first file using Python

1 Answers1