1

I have installed a standalone mode Hadoop on windows machine locally, with one datanode and the replication factor set as 1. I have already uploaded some data onto the datanode. Let us call this existing datanode as datanode1.

I would like to add additional one/two datanode(s) to the Hadoop, and change the replication factor to two/three and replicate existing data twice or three times.

For example, I would like to add just one additional datanode2, and replicate all the existing data in datanode1. For any newly uploaded data, it will be saved both in datanode1 and datanode2 respectively, as the replication factor is changed to two.

I have tried to change the hdfs-site.xml file to reflect above the changes (add datanode2 and changing replication factor to two) and start the star-all.cmd, but it seems that the existing data in datanode1 is not replicated and hadoop still has only one datanode.

Any idea on how to set it up?

XYZ
  • 352
  • 5
  • 19
  • I think you need to run a `setrep` hadoop fs command for existing files – OneCricketeer Jan 13 '21 at 16:58
  • @OneCricketeer, "Run a setrep" will not replicate existing data on the datanode. Also, I have tried to run a setrep and change the replication factor to two and add one more datanode on the hdfs-site.xml file. But still there is only one datanode. The added additional datanode is not recognized by the standalone hdfs system. I am not sure whethere standalone hdfs supports multi-datanodes. – XYZ Jan 19 '21 at 12:27
  • What do you mean "standalone hdfs"? Psuedo-distributed mode is able to add multiple datanode processes, however using Hadoop on windows and NTFS formatted drives, in general, is discouraged. – OneCricketeer Jan 19 '21 at 16:02
  • @OneCricketeer, I am following the terminology "Local (Standalone) Mode" HDFS from the official website https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation . I do not intend to use Hadoop on windows for production purpose. My goal is to be able to set up a test environment so that I can play around and get familiar. Currently I do not want to work on the linux (Fully-Distributed Mode) production environment because I am a little bit afraid I may break the existing HDFS system in the production servers. – XYZ Jan 19 '21 at 16:25
  • 1
    Updated based on what I have read in recent months: 1. do the test and production on Linux machine. If Linux machine is not available, at least try to do on virtual machines of linux operating system. 2. It is not possible to have multi datanodes on a standalone mode. If there are only one machine, it can be achieve with virtual machines to have multi-nodes. Only in multi-nodes mode, addtiona/existing datanode can be added or excluded. – XYZ May 18 '21 at 12:08
  • 1
    You can use Docker instead of full VMs, too – OneCricketeer May 18 '21 at 12:28

0 Answers0