Split file during writing

Question

gurus!

A long time i can't found answer on following question: how hadoop splitting big file during writing. Example: 1) Block size 64 Mb 2) File size 128 Mb (flat file, containing text).

When i writing file it's will be split at 2 part (file size / block size). But... Could occurrence following Block1 will be ended at ... word300 word301 wo and Block 2 will be start rd302 word303 ... Write case will be

Block1 will be ended at ... word300 word301 and Block 2 will be start word302** word303 ...

or can you link at the place where write about hadoop splitting algoritms.

Thank you in advance!

score 0 · Answer 1 · answered Jun 18 '12 at 16:49

The file will be split arbitrarily based on bytes. So it will likely split it into something like wo and rd302.

This is not a problem you have to typically worried about and is how the system is designed. The InputFormat and RecordReader part of a MapReduce job deal with records split between record boundaries.

score 0 · Accepted Answer · answered Jun 18 '12 at 17:53

0

looks this wiki page, hadoop InputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, it ignores the content up to the first newline.

answered Jun 18 '12 at 17:53

Chun

279
1
7

Split file during writing

2 Answers2