0

I have 2 questions for A hadoop as a storage system.

  1. I have a hadoop cluster of 3 data node and I want to direct splits of a huge file say of size 128mb (assuming that split size is 64mb ) to my choice of data node. That is how to control which split goes to which DataNode in such case. I mean lets say we have 3 data node( ie D1,D2,D3) and we want particular split (let say 'A') which I wish it to move to particular data node let it be D2.

    How can we do this ?

  2. What is the smallest possible split size of a hadoop filesystem. How can we configure it to smallest split size.

j0k
  • 22,600
  • 28
  • 79
  • 90
Ankur Saran
  • 99
  • 1
  • 9

1 Answers1

1

1) You can't control where the data blocks are placed

2) As small as you want (should probably be a multiple of 1024 bytes though but i don't think there is an actual constraint in this), but on modern hardware, anything smaller than 64 / 128 MB is inefficient (you can specify a smaller processing split size if you are doing anything CPU intensive in the MR Job)

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Thankyou Chris For your answer . My question was about directing new incomming data to perticular data node with altering the source or writing a application or placing some priority. – Ankur Saran Aug 14 '12 at 05:16
  • Can we do some changes in source to controll block flow. At least can we distribute block on basis of md5 checksum ie blocks with md5 sum of 1-100 goes to NodeA and 100-200 goes to NodeB and 200-300 goes to NodeC and so on. – Ankur Saran Nov 21 '12 at 07:35