0

In hadoop filesystem, I have two files say X and Y. Normally, hadoop makes chunks of files X and Y of 64 MB in size. Is it possible to force hadoop to divide the two files such that a 64 MB chunk is created out of 32 MB from X and 32 MB from Y. In other words, is it possible to override the default behaviour of file partitioning?

justin waugh
  • 885
  • 3
  • 12
  • 22
  • Might make more sense to pre-process the files before dumping them into HDFS. Doing something like what you're asking is possible, just rather ugly. – rICh Dec 02 '12 at 03:59

1 Answers1

0

File partitioning is a function of the FileInputFormat, since it is logically depends on the file format. You can create you own input with any other format. So per single file - you can do it.
Mixing two part of the different files in the single split sounds problematic - since file is a basic unit of processing.
Why do you have such requirement? I see the requriement below. Can be stated that data locality has to be sucrificed at least in part - we can run map local to one file but not to both.
I would suggest building some kind of "file pairs" file, putting it into distributed cache and then, in the map function load second file from the HDFS.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • My application is such that I need parts of both files in one map task in order to process them. If I have only one file's contents in a map task, then it cannot be independently processed. – justin waugh Apr 23 '12 at 19:00
  • I was doing what you have suggested. Thanks. – justin waugh Apr 25 '12 at 03:29