Hadoop data split and data flow control

Question

I have 2 questions for A hadoop as a storage system.

I have a hadoop cluster of 3 data node and I want to direct splits of a huge file say of size 128mb (assuming that split size is 64mb ) to my choice of data node. That is how to control which split goes to which DataNode in such case. I mean lets say we have 3 data node( ie D1,D2,D3) and we want particular split (let say 'A') which I wish it to move to particular data node let it be D2.

How can we do this ?
What is the smallest possible split size of a hadoop filesystem. How can we configure it to smallest split size.

score 1 · Accepted Answer · answered Aug 14 '12 at 00:47

1

1) You can't control where the data blocks are placed

2) As small as you want (should probably be a multiple of 1024 bytes though but i don't think there is an actual constraint in this), but on modern hardware, anything smaller than 64 / 128 MB is inefficient (you can specify a smaller processing split size if you are doing anything CPU intensive in the MR Job)

answered Aug 14 '12 at 00:47

Chris White

29,949
4
71
93

Thankyou Chris For your answer . My question was about directing new incomming data to perticular data node with altering the source or writing a application or placing some priority. – Ankur Saran Aug 14 '12 at 05:16
Can we do some changes in source to controll block flow. At least can we distribute block on basis of md5 checksum ie blocks with md5 sum of 1-100 goes to NodeA and 100-200 goes to NodeB and 200-300 goes to NodeC and so on. – Ankur Saran Nov 21 '12 at 07:35

Hadoop data split and data flow control

1 Answers1