Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel. However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level. Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.
Can anyone tell how exactly this will work?
- additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?