0

gurus!

A long time i can't found answer on following question: how hadoop splitting big file during writing. Example: 1) Block size 64 Mb 2) File size 128 Mb (flat file, containing text).

When i writing file it's will be split at 2 part (file size / block size). But... Could occurrence following Block1 will be ended at ... word300 word301 wo and Block 2 will be start rd302 word303 ... Write case will be

Block1 will be ended at ... word300 word301 and Block 2 will be start word302** word303 ...

or can you link at the place where write about hadoop splitting algoritms.

Thank you in advance!

Mijatovic
  • 229
  • 1
  • 3
  • 7

2 Answers2

0

The file will be split arbitrarily based on bytes. So it will likely split it into something like wo and rd302.

This is not a problem you have to typically worried about and is how the system is designed. The InputFormat and RecordReader part of a MapReduce job deal with records split between record boundaries.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
0

looks this wiki page, hadoop InputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, it ignores the content up to the first newline.

Chun
  • 279
  • 1
  • 7