0

I have noticed that sometimes s3-dist-cp takes much longer than usual due to a "slow node" issue. In case of spark I have enabled speculative execution which works fine. Howerver, when it comes to s3-dist-cp I would like to understand possible impact first.

In case of regular dist-cp I found that (link: https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#MapReduce_and_other_side-effects):

If mapreduce.map.speculative is set set final and true, the result of the copy is undefined.

I'm aware that s3-dist-cp is a completely separate job, but I wonder if there any caveats. I wasn't able to find any related documentation.

Thanks for any suggestions!

Grzes
  • 971
  • 1
  • 13
  • 28
  • So what is the question then in fact? – thebluephantom Jul 07 '21 at 17:43
  • Are you running many jobs at the same time with s3DistCP? – thebluephantom Jul 07 '21 at 18:09
  • I would look at the hadoop source rather than ask anyone on stack overflow whose opinions will only be second hand (other user) or, if they wrote bits of distcp, probably out of date. Note that s3-distcp isn't open source, so look at `distcp -direct -numListstatusThreads 40` for cloud perf – stevel Jul 07 '21 at 18:16
  • Nevermind, it seems that `mapreduce.reduce.speculative` is programatically set to `false`. Even if you specify `-Dmapreduce.reduce.speculative=true`, it is ignored. – Grzes Jul 08 '21 at 08:19
  • @stevel never underestimate others – thebluephantom Jul 08 '21 at 14:37
  • 1
    @thebluephantom maybe, but despite being someone who maintains distcp. even I wouldn't make any assertions without reviewing the code to see what the latest behaviour is. And stack overflow answers get out of date so fast... – stevel Jul 08 '21 at 14:46
  • @stevel point taken – thebluephantom Jul 10 '21 at 06:28

0 Answers0