4

How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?

As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz

matthiash
  • 3,105
  • 3
  • 23
  • 34

3 Answers3

4

I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.

  Crush --max-file-blocks XXX /data/input /data/output

max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:

With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size.

Jerome Serrano
  • 1,835
  • 2
  • 16
  • 27
1

If you set the Parallel to 1 - then you will have single output file. This can be done in 2 ways:

  1. in your pig add set default_parallel 20; but note that this effect everything in your pig
  2. Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;

Can read more about Parallel Features

Mzf
  • 5,210
  • 2
  • 24
  • 37
0

I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.

Anil
  • 11
  • 1