This snippet borrows from the code of the Bulk Decompress Template. It also borrows from this quetstion&answer.
As you well noticed, TAR is not supported, but in general, compression/decompression in Beam seems to rely in Apache Commons' Compression libraries.
You would write a pipeline that does something like this:
// Create the pipeline
Pipeline pipeline = Pipeline.create(options);
// Run the pipeline over the work items.
PCollectionTuple decompressOut =
pipeline
.apply("MatchFile(s)",
FileIO.match().filepattern(options.getInputFilePattern()))
.apply(
"DecompressFile(s)",
ParDo.of(new Decompress(options.getOutputDirectory());
Where your Decompress
DoFn would look something like this:
class Dearchive extends DoFn<MatchResult.Metadata, String> {
@ProcessElement
public void process(@Context ProcessContext context) {
ResourceId inputFile = context.element().resourceId();
String outputFilename = Files.getNameWithoutExtension(inputFile.toString());
ResourceId tempFileDir =
this.outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_DIRECTORY);
TarArchiveInputStream tarInput = new TarArchiveInputStream(
Channels.newInputStream(FileSystems.open(inputFile)));
TarArchiveEntry currentEntry = tarInput.getNextTarEntry();
while (currentEntry != null) {
br = new BufferedReader(new InputStreamReader(tarInput)); // Read directly
ResourceId outputFile = tempFileDir.resolve(currentEntry.getName(),
StandardResolveOptions.RESOLVE_FILE);
try (WritableByteChannel writerChannel = FileSystems.create(tempFile, MimeTypes.TEXT)) {
ByteStreams.copy(tarInput, Channels.newOutputStream(writerChannel));
}
context.output(outputFile.toString());
currentEntry = tarInput.getNextTarEntry(); // Iterate to the next file
}
}
}
This is a very rough and untested code snippet, but it should get you started on the right path. LMK if we should clarify further.