hadoop distcp between clusters with different replication factors

Question

As some background, we have 2 clusters which are currently used as production and development. As part of this, we are copying files (using hadoop distcp -update) from the production cluster to the development cluster after they have been produced by the live processes (ie it effectively also works as a DR cluster).

Hadoop version is the same on both clusters: Hadoop 2.6.0-cdh5.12.1

However, the development cluster only has about 65% of the storage capacity of the live cluster. To deal with that, we have a default replication factor of 3 for live and 2 for development.

I've noticed that the files that are being copied from live to development have a replication factor of 3. I've done some reading and think this is how it should be behaving, even if it's not how I'd like it to behave.

I have two questions off the back of this:

From some research is has been suggested that -setrep could be used post copy but -D dfs.replication=x could be used as part of the copy command. Has anyone had any experience with either of these options?
Has anyone had to deal with this situation before and found a different solution?

Thanks for your help.

score 4 · Accepted Answer · answered Dec 19 '17 at 08:56

I've done some testing and done the following:

Changed the distcp command from hadoop distcp -update $SOURCE $TARGET to hadoop distctp -D dfs.replication=2 -update $SOURCE $TARGET
Run through the files that were copied previously and therefore had a replication factor of 3 and used hdfs dfs -setrep -w 2 $TARGET to amend the replication factor.

Disk space has started to fall, so I'm counting this as a success. Maybe one day I'll be able to claim I know what I'm doing.

hadoop distcp between clusters with different replication factors

1 Answers1