We have many distcp jobs copying data from our primary cluster to our backup cluster. these jobs run all day and copy almost all tables of critical databases. We use webhdfs here.
Some of these jobs run for hours ( for tables (ORC format ones )that are huge .Is there any way we can optimise the distcp operation between two clusters. Any suggestions are welcome.
We tried using bandwidth to speed up. below is the excerpt from our script.
PROP="-Dmapreduce.task.timeout=300000 -Dmapred.job.queue.name=$YARN_QUEUE -Dmapred.job.name="cpy-${jobName}" -bandwidth 800 "
hadoop distcp ${PROP} $1 WEBHDFS://$DESTNAMENODE$2 >> $3 2>&1