File split/partition in hadoop

Question

In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In other words, is it possible to override the default behaviour of file partitioning?

Might make more sense to pre-process the files before dumping them into HDFS. Doing something like what you're asking is possible, just rather ugly. — rICh, Dec 02 '12 at 03:59

David Gruzman · Accepted Answer · 2012-04-24T04:49:11.137

0

File partitioning is a function of the FileInputFormat, since it is logically depends on the file format. You can create you own input with any other format. So per single file - you can do it.
Mixing two part of the different files in the single split sounds problematic - since file is a basic unit of processing.
Why do you have such requirement? I see the requriement below. Can be stated that data locality has to be sucrificed at least in part - we can run map local to one file but not to both.
I would suggest building some kind of "file pairs" file, putting it into distributed cache and then, in the map function load second file from the HDFS.

edited Apr 24 '12 at 04:49

answered Apr 23 '12 at 17:53

David Gruzman

7,900
1
28
30

My application is such that I need parts of both files in one map task in order to process them. If I have only one file's contents in a map task, then it cannot be independently processed. – justin waugh Apr 23 '12 at 19:00
I was doing what you have suggested. Thanks. – justin waugh Apr 25 '12 at 03:29

File split/partition in hadoop

1 Answers1