As some background, we have 2 clusters which are currently used as production and development. As part of this, we are copying files (using hadoop distcp -update) from the production cluster to the development cluster after they have been produced by the live processes (ie it effectively also works as a DR cluster).
Hadoop version is the same on both clusters: Hadoop 2.6.0-cdh5.12.1
However, the development cluster only has about 65% of the storage capacity of the live cluster. To deal with that, we have a default replication factor of 3 for live and 2 for development.
I've noticed that the files that are being copied from live to development have a replication factor of 3. I've done some reading and think this is how it should be behaving, even if it's not how I'd like it to behave.
I have two questions off the back of this:
- From some research is has been suggested that -setrep could be used post copy but -D dfs.replication=x could be used as part of the copy command. Has anyone had any experience with either of these options?
- Has anyone had to deal with this situation before and found a different solution?
Thanks for your help.