In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In other words, is it possible to override the default behaviour of file partitioning?
Asked
Active
Viewed 796 times
0
-
Might make more sense to pre-process the files before dumping them into HDFS. Doing something like what you're asking is possible, just rather ugly. – rICh Dec 02 '12 at 03:59
1 Answers
0
File partitioning is a function of the FileInputFormat, since it is logically depends on the file format. You can create you own input with any other format. So per single file - you can do it.
Mixing two part of the different files in the single split sounds problematic - since file is a basic unit of processing.
Why do you have such requirement?
I see the requriement below. Can be stated that data locality has to be sucrificed at least in part - we can run map local to one file but not to both.
I would suggest building some kind of "file pairs" file, putting it into distributed cache and then, in the map function load second file from the HDFS.

David Gruzman
- 7,900
- 1
- 28
- 30
-
My application is such that I need parts of both files in one map task in order to process them. If I have only one file's contents in a map task, then it cannot be independently processed. – justin waugh Apr 23 '12 at 19:00
-