I have a single node (pseudo-distributed config) and I'm considering adding a 2nd slave node. Does it matter if the slave has less HD capacity ? Will the rebalance take of that for itself. I'm not an HADOOP expert by far.
Asked
Active
Viewed 1,363 times
2 Answers
1
No it doesn't matter but HDFS will not redistribute the blocks to the new node automatically so you will have to do that on your side. The easiest way is to run bin/start-balancer.sh
. Also, before you do any rebalancing, make sure you modify your conf files accordingly to accommodate moving away from a pseudo-distributed configuration to a cluster one.
Check this question on the Hadoop FAQ for more ways to rebalance.
-
OK looks perfect to me. So I will use the `start-balancer.sh` first. I thought hadoop 'duplicates' content over nodes. – millebii May 07 '11 at 10:21
-
Actually Hadoop in a cluster configuration will maintain each block at as a replicate of three. That is by default the dfs.replication parameter in conf/hdfs-site.xml is set to 3. In your case you should set it to 2 and adjust it as you add more datanodes. – May 09 '11 at 05:24
-
In my case that would meand then that the smaller configuration (the new node) will determine the size of the hdfs: this is what I was worried of please confirm. Thx – millebii May 09 '11 at 06:10
-
If the discrepancy between hard disk sizes is sizable and you are expecting to load up HDFS to capacity then yes, you will be handicapped by using replication at all. In that case you should set it to 1 and optimize your block size to the type of files you are distributing over HDFS. You will be giving up a lot in robustness and fault tolerance but if I am understanding your issue correctly, you are in a corner case and should be willing to take those hits no matter. – May 09 '11 at 21:00
-
If the comments plus answer solved your problem, you mind marking this as solved else let me know where you are stuck and I'll help you solve it. – May 10 '11 at 18:13
-
Sorry was on burning launch operation the whole day. Sizable difference indeed x3 and I'm currently using 60% of big HD (my application is Nutch). So yes, I think I'll start with a replication @ 1 to add HD & CPU capacity. I may enable replication later when I will be over 3 nodes but not to start with. Don't understand your block size remark. – millebii May 10 '11 at 21:16
0
Hadoop will balance the load. In addition you can set "dfs.replication" property to set the number of replications you want.