2

Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel. However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level. Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.

Can anyone tell how exactly this will work?

  • additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?
Tianqi Tong
  • 83
  • 1
  • 7

1 Answers1

2

For a single file of ~50G size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp.

Quoting from the documentation:

Why does DistCp not run faster when more maps are specified?

At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files.

UPDATE
The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.

Community
  • 1
  • 1
franklinsijo
  • 17,784
  • 4
  • 45
  • 63
  • thanks! are you saying the splits of the single file will be copied by multiple mappers in parallel? but according to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers. i don't really understand this part. please refer to: https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html – Tianqi Tong Feb 20 '17 at 23:00
  • @TianqiTong, you are right, till date I was with the misconception that Distcp uses the typical mapreduce process. I have corrected my answer. – franklinsijo Feb 21 '17 at 08:36
  • Thanks for correction! I've updated my question a bit so others will start to like it (hopefully). but another thing i don't understand is, how does that mapper find all the pieces of the single file on a single node. (I've put that in description already) – Tianqi Tong Feb 24 '17 at 21:03
  • There is no necessity also highly unlikely to find all the blocks in a single node, mapper will fetch the blocks from other nodes whenever necessary. The block locations of the file is obtained from the namenode. Mapper will be initiated on the node where the first block of the file is present, the rest of the blocks if available in the same node, will be used else will be fetched from the neighborhood. – franklinsijo Feb 25 '17 at 02:39
  • So does that mean distcp will not give any performance improvement over cp in case there is only a single file? – Farsan Rashid Aug 25 '18 at 10:02