Unzipping large bzip2 file with google dataflow

Question

I have a bunch of mysql dumps compressed with bzip2 on google cloud storage. I would like to uncompress them.

I tried using a pipeline defined like this:

p.apply(TextIO
        .Read
        .from("gs://bucket/dump.sql.bz2")
        .withCompressionType(TextIO.CompressionType.BZIP2))
 .apply(TextIO
        .Write
        .to("gs://bucket/dump.sql")
        .withoutSharding());

The compressed file are around 5GB and the uncompressed files should be around 50GB.

The problem is that the resulting file is only around 800kB and consists of the first bunch of lines.

Is there something I'm doing wrong? Or is there another simple way of automating uncompression of files on google cloud storage?

Edit: I have found that this only happends when the files are compressed with pbzip2, when bzip2 is used things are fine. It also seems that only the first block is read. When I decrease the blocksize the size of the incomplete outputfile follows.

What version of the SDK are you using? Could you share a job ID? — Ben Chambers, Jul 13 '17 at 15:26
+1, this seems likely to be caused by an older SDK which uses an older version of commons-compress. There's been at least one commit in Apache Beam that upgrades commons-compress to fix a bug in it https://issues.apache.org/jira/browse/BEAM-2373 , and commons-compress used to have a bug with truncating bz2 files https://issues.apache.org/jira/browse/COMPRESS-185 — jkff, Jul 13 '17 at 20:00
I have updated to 2.0.0 now. It still seems to behave in the same way. I just ran a job with jobid 2017-07-24_01_07_08-6568754560306143139 — Fernet, Jul 24 '17 at 08:12

Unzipping large bzip2 file with google dataflow

0 Answers0