0

I'm looking for a way how to zip a (big) file stored in a google-bucket and write the compressed file to a google-bucket too.

This command-sequence works fast and fine:

gsutil cat gs://bucket/20190515.csv | zip | gsutil cp - gs://bucket/20190515.csv.zip

...but it has the problem that the filename inside the ZIP has the useless name "-".

On the other hand, if I use the sequence:

gsutil cp gs://bucket/20190515.csv .
zip -m 20190515.csv.zip 20190515.csv
gsutil mv 20190515.csv.zip gs://bucket/20190515.csv.zip

...then I get a usable name in the ZIP - but the command takes extremely long and needs a correspondingly large (virtual) hard disk.

Jim Morrison
  • 2,784
  • 1
  • 7
  • 11
dede
  • 706
  • 9
  • 19
  • Does it need to be a DOS style zip file? `gzip` and `tar` handle this better as these are native Unix formats designed for use cases like piping – that other guy May 16 '19 at 17:39
  • can't the `zip` portion of you pipeline create the file directly? (Don't know squat about `gsutil` ;-) ). Good luck. – shellter May 16 '19 at 17:39
  • 1
    Since you have added the `python` tag, check out [this](https://stackoverflow.com/a/55516204/5008284) streaming python solution to compress stdin to stdout as a zip file that holds the data as a file named, in the example code, `TheLogFile`. – meuh May 16 '19 at 18:32

1 Answers1

3

Thanks to meuh's advice, I now have a solution:

#!/usr/bin/python3
import sys, zipstream
with zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED) as z:
    z.write_iter(sys.argv[1], sys.stdin.buffer)
    for chunk in z:
        sys.stdout.buffer.write(chunk)

..stored as streamzip.py. Then the following call:

fn="bucket/20190515.csv"
execCmd("gsutil cat gs://%s | streamzip.py %s | gsutil cp - gs://%s.zip"%(fn, fn.split("/")[-1], fn))

...gives the desired result.

dede
  • 706
  • 9
  • 19