How to write gzipped content to an HDFS file in iterations?

Question

I am able to write gzipped compressed data to an HDFS file. But what if I have to write this data in iterations. So, for example, in one iteration, I compress a string using the GZIPOutputStream library, and write that compressed string to the HDFS file. For the next iteration, I want to do the same, and add/append the compressed string of that iteration to the HDFS file. Would this be possible to do? The final HDFS file should be in a valid gzipped format.

Any reason you can't just output to a separate file? You don't want large gzip volumes on hdfs anyway. — puhlen, Jan 31 '17 at 15:56
Well those gzipped files are given as input to a separate program. I could write them in a separate file, but then the performance of that program to which they are given as an input, suffer. — pythonic, Jan 31 '17 at 15:57
gzip isn't a good fit for this then, to append you need to read and unzip the existing file, append your content then re zip overwriting the original. Not to mention reading a gzip file has to happen on a single node. Maybe just use plaintext. Are you using spark or some other hadoop processing engine? They will be able to handle multiple files. If you aren't using a distributed processing engine, why are you using hdfs? — puhlen, Jan 31 '17 at 16:02
Yes, I am using Spark with HDFS. And I'm fine with reading one gzip file per task. Maybe, I'll just upload uncompressed data, as HDFS normally can handle a lot of data anyway. — pythonic, Jan 31 '17 at 16:03
You can point spark at a directory and it will read all files in the directory. This is preferable that having one large gzip files because you can process them in parallel instead of needing to load the entire file onto a single node for decompression. — puhlen, Jan 31 '17 at 18:36

How to write gzipped content to an HDFS file in iterations?

0 Answers0