In Hadoop, I have a sequence file of 3GB size. I want to process it in parallel. Therefore, I am going to create 8 maptasks and hence 8 FileSplits.
FileSplit class has constructors that require the:
Path of the file
Start position
Length
For example the fisrt split can be from 0 with length 3GB/8 and the next split from 3GB/8 with length 3GB/8 and so forth.
Now the SequenceFile.Reader has a constructor that takes same:
Path of the file
Start position
Length
For the first split (from 0 with length 3Gb/8) the sequence file was able to read it as it contains the header of the file, the compression type, and information about the key and value classes.
However, for the other splits the SequenceFile.Reader was not able to read the split because, I think, that portion of the file doesn't contain the header of the sequence file (becuase the file split is not starting from 0) and hence it throws a NullPointerException when I tried to use the sequence file.
So is there a way to make file splits from the sequence file?