0

I have basically the same question as this, but rather than using awk I'd like to use Python, assuming it's not substantially slower than using some other method. I was thinking of reading line by line and compressing on the fly, but then I came across this post, and it sounds like that would be a bad idea (compression not very efficient). I came across this nice-looking gzip built-in Python library, so I'm hoping there is some clean, fast, and efficient pythonic way to do this.

I want to go from this:

gzcat file1.gz
# header
1
2

to this:

# header
1
2
1
2
1
2
1
2

I have a few hundred files, and the total uncompressed is about 60 GB. The files are gzipped CSV files.

user554481
  • 1,875
  • 4
  • 26
  • 47
  • 1
    probably it's going to be way too slow in Python. i suggest trying `subprocess` module to call `gzcat` from within your Python code. simple example of that (with zcat): https://codebright.wordpress.com/2011/03/25/139/ profiling and related discussion: http://www.dalkescientific.com/writings/diary/archive/2020/09/16/faster_gzip_reading_in_python.html – mechanical_meat Jan 28 '22 at 19:55

1 Answers1

0

Since you need to remove the first line of each CSV file, you have no choice but to decompress all of the data and recompress it.

You can open a gzip output file to write the result with with gzip.open('all.gz', 'wb') as g:.

You can open each input file using with gzip.open('filen.gz', 'rb') as f:, and then x = f.readline() on that object to read the uncompressed data one line at a time. You then have the option to, for example, discard the first line from each file but the first.

For the lines you want to keep, you can write them to the output with g.write(x).

Mark Adler
  • 101,978
  • 13
  • 118
  • 158