Write binary data from HDFS files to SequenceFile

Question

I a lot of files in HDFS and want to copy them into sequence files by MR job. The key type of the seq file is TEXT (I use SHA1), and the value type is BytesWritable(the file content). I find some example code reads all the file content into a byte array, say buffer, then set the buffer to the ByteWritable object. Ex:

byte[] buffer = new byte[(int)file.length()];
FileInputStream fis = new FileInputStream(fileEntry);
int length = fis.read(buffer);
fis.close();
key.set(sha1);
value.set(buffer, 0, buffer.length);
writer.append(key, value);

My question is: If my input file is very big, the buffer size my exceed memory limit. Can I append to the ByteWritable object with a loop that writes smaller amount of data in each iteration? Or can I just assign a input stream to the BytesWritable object and let it handle the problem?

Thanks.

My sample code reads file from local file system. I want to read from HDFS. — avhacker, Mar 18 '13 at 11:26

score -1 · Answer 1 · answered Mar 18 '13 at 11:18

-1

You can just use the HDFS equivalent of the Linux cat command:

hadoop fs -cat '/path/to/files/*.' > oneBigSequenceFile.csv

it will concatenate all the files inside one sequence file.

answered Mar 18 '13 at 11:18

Jacopofar

3,407
2
19
29

Write binary data from HDFS files to SequenceFile

1 Answers1