Hadoop: InputFormat for Variable-Length files without delimiter

Question

I have to process (by Hadoop) variable-length files without delimiter. The format of these files is:

(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)

There is no delimiter between the records (the file is in one line). There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).

I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.

So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.

Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?

Thanks in advance.

Regards

According your description your input file seems to be non-splittable. You will need to write a custom record reader and force em to run on a single mapper only. An other option would be a custom written pre-processor which split the input i.e. into n records and write that into file first. After you would be able to able to read the multiple files. — U880D, Aug 02 '18 at 12:34

Hadoop: InputFormat for Variable-Length files without delimiter

0 Answers0