3

I need to copy a directory of from one cluster to another with similar HDFS (both are MAPR clusters).

I am planed to use DistCp Java API. But I wanted to avoid duplicate copies of files in the directory. I wanted to know whether these operations are fault tolerant? I.e if the files are not copied completely due to loss of connection, if the DistCp initiates the copies again to copy a file properly?

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Mahdi
  • 787
  • 1
  • 8
  • 33

2 Answers2

0

distcp uses MapReduce to effect its distribution, error handling and recovery, and reporting.

Please see Update and Overwrite

You can use -overwrite option to avoid duplicates Moreover, you can check update option as well. If network connection fails, once its connection recovered then you can re-initiate with overwrite option

See the examples of -update and -overwrite as mentioned in above guide link.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
0

Here is the link for refactored distcp: https://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html

As "@RamPrasad G" mentioned, I guess you have no option other than redo the distcp in case of network failure.

Some good reads:

Hadoop distcp network failures with WebHDFS

http://www.ghostar.org/2015/08/hadoop-distcp-network-failures-with-webhdfs/

Distcp between two HA Cluster

http://henning.kropponline.de/2015/03/15/distcp-two-ha-cluster/

Transferring Data to/from Altiscale via S3 using DistCp

https://documentation.altiscale.com/transferring-data-using-distcp This page has a link for a shell script with retry, which could be helpful to you.

Note: Thanks to original authors.

Marco99
  • 1,639
  • 1
  • 19
  • 32