0

I have just started learning Mapreduce and have some queries I want answers to. Here goes:

1) Case 1: FileInputFormat as Input format. A directory having multiple files to be processed is the Input Path. If I have n files, all of the files lesser than the block size in the hadoop cluster, How many splits are calculated for the map reduce Job?

2) I extend FileInputFormat in a class called MyFileInputFormat, and I override isSplitable to always return false. The input configuration is same as above. Will I get n splits in this case?

3) If say 1 of the files among the n files is slightly larger than the cluster's block size will I get n+1 splits in the second case?

Thanks in advance for the help!

1 Answers1

0

Lets start with Basics of FileInputFormat

  1. FileInputFormat is Abstract hence you cannot use it directly. "public abstract class FileInputFormat"

  2. Lets assume you use a InputFormat like TextInputFormat (class TextInputFormat extends FileInputFormat) and answer your question.

  3. The splitMethod logic in FileInputFormat is applicable per file in the input path hence you will have "N" splits in MapReduce jobs (Case1).

  4. For Case2 you will still have N splits , as you have just informed the inputformat to not split individual files. But for each file consider that as one split.

  5. For Case 3 you will still have N splits as the files are not being split. Do remember the split logic is applied to individual file and not considering them together.

  6. CombineInputFormat is what is used if you want to combine inputfiles during the split logic.

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31