I have to process (by Hadoop) variable-length files without delimiter. The format of these files is:
(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)
There is no delimiter between the records (the file is in one line). There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).
I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.
So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.
Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?
Thanks in advance.
Regards