Merging small files into single file in hdfs

Question

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :

1) product_info_timestamp

2) user_info_timestamp

3) user_activity_timestamp

The number of files received can be of any number but they will belong to one of these 3 categories only.

I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file. for eg: 3 files named product_info_* should be merged into one file named product_info.

How do i achieve this?

Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Super User](http://superuser.com/) or [Unix & Linux Stack Exchange](http://unix.stackexchange.com/) would be a better place to ask. — jww, May 02 '18 at 04:39

score 4 · Accepted Answer · answered Apr 30 '18 at 07:45

4

You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.

hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf

You can move them back to hdfs with put

hadoop fs -put  /local_path/product_inf /hdfs_path

answered Apr 30 '18 at 07:45

SCouto

7,808
5
32
49

Thank you .. will try this :) – user3829376 Apr 30 '18 at 08:43
Can we do this in mapreduce and feed it to oozie so that it runs on a daily basis automatically? – user3829376 May 09 '18 at 04:21
Dunno how to do that in MapReduce, but you can run a script in oozie. That script can perform that action or any other you need. If the asnwer help, please, feel free to upvote or accept it so other users with the same questions can find it easily – SCouto May 09 '18 at 06:14
I just tried it and it seems an easy solution. I am wondering what all the big deal is on small files problem then? Legacy in that getmerge is a newer command? @Scouto – thebluephantom Sep 25 '18 at 09:47

score 1 · Answer 2 · answered Apr 30 '18 at 16:45

1

You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".

answered Apr 30 '18 at 16:45

alex-arkhipov

72
1
7

score 1 · Answer 3 · answered May 01 '18 at 08:32

1

Another set of commands along the similar lines as suggested by @SCouto

hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt

hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

answered May 01 '18 at 08:32

Thomas

372
2
10

try this with 1TB daily data, will take hours to complete. would be nice to have a pure hdfs solution and not copying data to local node. – ulkas Jan 08 '19 at 06:51

Merging small files into single file in hdfs

3 Answers3