How to create splits from a sequence file in Hadoop?

Question

In Hadoop, I have a sequence file of 3GB size. I want to process it in parallel. Therefore, I am going to create 8 maptasks and hence 8 FileSplits.

FileSplit class has constructors that require the:

Path of the file
Start position
Length

For example the fisrt split can be from 0 with length 3GB/8 and the next split from 3GB/8 with length 3GB/8 and so forth.

Now the SequenceFile.Reader has a constructor that takes same:

Path of the file
Start position
Length

For the first split (from 0 with length 3Gb/8) the sequence file was able to read it as it contains the header of the file, the compression type, and information about the key and value classes.

However, for the other splits the SequenceFile.Reader was not able to read the split because, I think, that portion of the file doesn't contain the header of the sequence file (becuase the file split is not starting from 0) and hence it throws a NullPointerException when I tried to use the sequence file.

So is there a way to make file splits from the sequence file?

Do you need to know the block locations of the file ? ( If so, you can use "hdfs fsck -files -blocks -locations" to fetch the locations and send to the 8 different maps so they can process in parallel. — Deepan Ram, Apr 12 '17 at 13:37
@DeepanRam Blocks are different from splits. I want make file splits from the sequence file to use in map reduce programming. — Mosab Shaheen, Apr 12 '17 at 14:47
Now I get your requirement , you can use the SequenceFileInputFormat of hadoop and let it do all the computations on its own. The command that I shared will show the total splits of the file ( just for info purpose) , for each of the splits, one map task will be spawned . This also helps us to calculate how many maps will be spawned against this input file. — Deepan Ram, Apr 12 '17 at 17:28
@DeepanRam Thanks for information but I previously tried SequenceFileInputFormat and I think you cannot set the number of maps or splits using it. Anyway I posted the answer below. Kindly accept it as an answer to help others see the solution. — Mosab Shaheen, Apr 13 '17 at 12:12

score 0 · Answer 1 · answered Apr 13 '17 at 12:12

Well, the idea is that start and length parameters of SequenceFile.Reader is not for specifying portion of the sequence file rather it is for specifying the real beginning and span over a sequence file (e.g. In case you have a container file that contains five sequence files together, and you want to use one of them so specify start and length of the sequence file inside that container file. OR in case you want to read from the beginning of a sequence file to a specific length; however it is not possible to set the start to the middle of a sequence file because you will skip the header of the sequence file and you will get "not a sequence file error", thus you must set the start parameter to the beginning of the sequence file).

Therefore, the solution is to create your file split in your InputFormat as usual:

new FileSplit(path, start, span, hosts);

And you create the sequence reader in your RecordReader as usual (no need to specify start or length):

reader = new SequenceFile.Reader(fs, path, conf);// As usual
start = Split.getStart();
reader.sync(start);

The idea is here in "sync" which skips the amount of bytes specified by "start" of the split.

And for the nextKeyValue of the RecordReader:

    if ((reader.getPosition() >= (start + span)) || !reader.next(key, value)) {
        return false;
    } else {
        return true;
    }

How to create splits from a sequence file in Hadoop?

1 Answers1