I have a bunch of mysql dumps compressed with bzip2 on google cloud storage. I would like to uncompress them.
I tried using a pipeline defined like this:
p.apply(TextIO
.Read
.from("gs://bucket/dump.sql.bz2")
.withCompressionType(TextIO.CompressionType.BZIP2))
.apply(TextIO
.Write
.to("gs://bucket/dump.sql")
.withoutSharding());
The compressed file are around 5GB and the uncompressed files should be around 50GB.
The problem is that the resulting file is only around 800kB and consists of the first bunch of lines.
Is there something I'm doing wrong? Or is there another simple way of automating uncompression of files on google cloud storage?
Edit: I have found that this only happends when the files are compressed with pbzip2, when bzip2 is used things are fine. It also seems that only the first block is read. When I decrease the blocksize the size of the incomplete outputfile follows.