0

How would you go about organizing a process of zipping objects that reside an object storage?

For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.

Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.

A naive approach would look like this:

  1. download all files from object storage on a disk,
  2. zip all files into an archive,
  3. put it .zip back on an object storage,
  4. send a link to download the .zip file back to user.

However, there are multiple disadvantages here:

  1. sometimes files for even single user add up to gigabytes,
  2. if the process of zipping is interrupted, it has to start over.

What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?

oldhomemovie
  • 14,621
  • 13
  • 64
  • 99
  • You may use the cloud shell to keep the data transfer within the GCP as suggested here: https://stackoverflow.com/a/64606874/2777988 – Rakesh Gupta Oct 31 '22 at 22:15
  • You can't avoid downloading the content first for processing. There is no in-place zip (or any kind of processing) offered by Cloud Storage. The only thing you can do is choose a machine where the bandwidth to your bucket is maximized. This is your actual question - not so much designing the process but maximizing the speed to your bucket. – Doug Stevenson Nov 01 '22 at 00:04
  • I don’t know if the zip is a mandatory requirement for your case. However, if you can change it to tar.gz for example, you may not need to to read the whole set of files into memory before compression. That opens an opportunity to use a cloud function or a cloud run. Obviously there will be drawbacks as well. – al-dann Nov 01 '22 at 00:46

1 Answers1

4

Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.

The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.

Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.

You could make the process synchronous and display a progress bar, but that will complicate the design.

John Hanley
  • 74,467
  • 6
  • 95
  • 159