2

I have the need to only distcp x number of files.

Couldn't find a way to do it.

  1. One idea is to copy it over a temporary directory and then distcp that directory. Once complete I can delete that temp directory.

  2. Individual distcp commands (for each file). This could be painful.

Not sure if comma separation is allowed.

Any ideas?

Thanks in advance.

  • If they have a pattern, you can make use of the wildcards. Please show us sample of the directory structure. – franklinsijo May 03 '17 at 04:36
  • Just application directories. Imagine spark application history files. /var/log/spark/appHistory//. I just need a handful at a time. So wildcards are not super helpful. – Neelesh Salian May 03 '17 at 04:53

1 Answers1

4

You can either pass all the files as sources to the DistCp command

hadoop distcp hdfs://src_nn/var/log/spark/appHistory/<appId_1>/ \
              hdfs://src_nn/var/log/spark/appHistory/<appId_2>/ \
              ....
              hdfs://src_nn/var/log/spark/appHistory/<appId_n>/ \
              hdfs://dest_nn/target/

Or, Create a file containing the list of sources and pass it to the command as source with -f option

hadoop distcp -f hdfs://src_nn/list_of_files hdfs://dest_nn/target/
franklinsijo
  • 17,784
  • 4
  • 45
  • 63
  • 1
    Forgot to reply back. But this saved me a lot of trouble. Been using this trick for 5 months in a system. Works. Thanks @franklinsijo – Neelesh Salian Nov 16 '17 at 19:42
  • do we need to do ```kinit``` for both clusters so ```klist``` shows two tickets for both clusters? – s.r Aug 20 '18 at 15:59