Decompressing .tar file in Dataflow?

Question

I have a lot of .tar files in my GCP Cloud Storage Bucket. Each .tar file has multiple layers. I'd like to decompress those .tar files using GCP Dataflow and put them back into another GCP Storage Bucket.

I found the Google-provided utility template for Bulk Decompress Cloud Storage Files, but it doesn't support .tar file extensions.

Maybe I should try to decompress the files before uploading to the cloud, or is there something else that exists in Beam?

Each tar file is about 15 TB uncompressed.

score 1 · Answer 1 · answered Aug 14 '20 at 00:21

This snippet borrows from the code of the Bulk Decompress Template. It also borrows from this quetstion&answer.

As you well noticed, TAR is not supported, but in general, compression/decompression in Beam seems to rely in Apache Commons' Compression libraries.

You would write a pipeline that does something like this:

// Create the pipeline
Pipeline pipeline = Pipeline.create(options);

// Run the pipeline over the work items.
PCollectionTuple decompressOut =
    pipeline
        .apply("MatchFile(s)",
            FileIO.match().filepattern(options.getInputFilePattern()))
        .apply(
            "DecompressFile(s)",
            ParDo.of(new Decompress(options.getOutputDirectory());

Where your Decompress DoFn would look something like this:

class Dearchive extends DoFn<MatchResult.Metadata, String> {
  @ProcessElement
  public void process(@Context ProcessContext context) {
    ResourceId inputFile = context.element().resourceId();
    String outputFilename = Files.getNameWithoutExtension(inputFile.toString());
    ResourceId tempFileDir =
          this.outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_DIRECTORY);
    
    TarArchiveInputStream tarInput = new TarArchiveInputStream(
        Channels.newInputStream(FileSystems.open(inputFile)));

    TarArchiveEntry currentEntry = tarInput.getNextTarEntry();

    while (currentEntry != null) {
        br = new BufferedReader(new InputStreamReader(tarInput)); // Read directly 
        ResourceId outputFile = tempFileDir.resolve(currentEntry.getName(), 
            StandardResolveOptions.RESOLVE_FILE);
        try (WritableByteChannel writerChannel = FileSystems.create(tempFile, MimeTypes.TEXT)) {
          ByteStreams.copy(tarInput, Channels.newOutputStream(writerChannel));
        }
        context.output(outputFile.toString());
        currentEntry = tarInput.getNextTarEntry(); // Iterate to the next file
    }
  }
}

This is a very rough and untested code snippet, but it should get you started on the right path. LMK if we should clarify further.

+1, Dataflow Templates are not only useful for running directly, but also provide a wealth of well-tested pipelines that you can modify to suit your needs. — robertwb, Aug 21 '20 at 00:28

Decompressing .tar file in Dataflow?

1 Answers1