Merging compressed files on HDFS

Question

How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?

As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz

Jerome Serrano · Answer 1 · 2015-05-07T03:10:57.203

I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.

  Crush --max-file-blocks XXX /data/input /data/output

max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:

With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size.

score 1 · Answer 2 · answered May 06 '15 at 18:24

If you set the Parallel to 1 - then you will have single output file. This can be done in 2 ways:

in your pig add set default_parallel 20; but note that this effect everything in your pig
Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;

Can read more about Parallel Features

score 0 · Answer 3 · answered May 06 '15 at 14:23

0

I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.

answered May 06 '15 at 14:23

Anil

11
1

1

But I want to avoid having to transfer the data back and forward to the local filesystem. – matthiash May 06 '15 at 14:29

Merging compressed files on HDFS

3 Answers3