Questions tagged [distcp]

hadoop tool used for large inter- and intra-cluster copying.

The distcp command is a tool used for large inter- and intra- copying. It uses to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

181 questions
1
vote
2 answers

when i do distcp whether mapper will run in Source or destination

I am running a Distcp in hadoop to load the data from dev cluster to production cluster .. my question is from where the resources will take.. is it from source or destination?
user8587005
  • 65
  • 1
  • 2
  • 8
1
vote
0 answers

using distcp to copy from on-prem to azure blob

I'm trying to copy petitioned parquet files (created by sqoop) to Azure Blob using distcpt utility but with no luck. I'm runing my code on Hadoop Hortonworks 2.7.3.2.6.4.0-91. I can create folders and files using: hadoop fs -D…
johovic
  • 11
  • 2
1
vote
1 answer

Using GroupBy while copying from HDFS to S3 to merge files within a folder

I have the following folders in HDFS…
Amistad
  • 7,100
  • 13
  • 48
  • 75
1
vote
1 answer

hadoop discp issue while copying singe file

(Note: I need to use distcp to get parallelism) I have 2 files in /user/bhavesh folder I have 1 file in /user/bhavesh1 folder Copying 2 files from /user/bhavesh to /user/uday folder (This work fine) This create /user/uday folder Copying 1 file…
Bhavesh
  • 909
  • 2
  • 23
  • 38
1
vote
1 answer

s3DistCp order of concatenation of files

I am trying to use the S3DistCp tool on AWS EMR to merge multiple files (1.txt, 2.txt, 3.txt) to a single gzip file. I am using the groupBy flag. For now the output seems like the concatenation of source files in the reverse order by name. So the…
namitha gs
  • 13
  • 3
1
vote
2 answers

Efficient copy method in Hadoop

Is there a faster or more efficient way of copying files across HDFS other than distcp. I tried both the regular hadoop fs -cp as well as distcp and both seem to be giving the same transfer rate, around 50 MBPS. I have 5TB of data split into smaller…
Vinay
  • 1,473
  • 4
  • 14
  • 24
1
vote
1 answer

s3 serverside encryption using oozie workflow

I have a sqoop job which will write the data into s3 bucket. If I run this job from command line it is encrypting the s3 files. But if I use the same jar file to run sqoop job using oozie workflow then it is pushing the data to s3 but encryption is…
Vijay
  • 924
  • 1
  • 12
  • 27
1
vote
1 answer

Transfer of files from unsecured hdfs to secured hdfs cluster

I wanted to transfer files from unsecured HDFS cluster to kerberized cluster. I am using distcp to transfer the files. I have used the following command. hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true hdfs://:8020/
Midhun Mathew Sunny
  • 1,271
  • 4
  • 17
  • 30
1
vote
1 answer

How to copy HDFS files from one cluster to another cluster by preserving the modification time

I have to move some HDFS files from my production cluster to dev cluster. I have to test some operations on HDFS files after moving to dev cluster based on the file modification time. Need files with different dates to test it in dev. I tried doing…
Rob
  • 162
  • 3
  • 13
1
vote
1 answer

distcp: copy file from hdfs to s3 (How to use in scala or java)

I am trying to copy huge files from hdfs to s3 via distcp through the following code: val files:Array[String] = new Array[String](2) files(0) = "/****/in.zip" val in = new Path(new URI("/**/in.zip")) val out = new Path(new URI("***/out.zip")) var…
Obadah Meslmani
  • 339
  • 3
  • 15
1
vote
1 answer

Oozie - Setting strategy on DistCp through action configuration

I have a workflow with a distCp action, and it's running fairly well. However, now I'm trying to change the copy strategy and am unable to do that through the action arguments. The documentation is fairly slim on this topic and looking at the…
davdic
  • 249
  • 3
  • 10
1
vote
1 answer

Key file distribution in Hadoop cluster

I want to send a lot of files from HDFS to Google Storage (GS). So I want to use distcp command this this case. hadoop distcp -libjars -m hdfs://:/ gs://
dmreshet
  • 1,496
  • 3
  • 18
  • 28
1
vote
0 answers

The nameservice for a Hadoop Namenode HA should be discoverable across clusters

Requirement: The nameservice for a Hadoop Namenode HA should be discoverable across clusters. Solution #1: One solution I found online is to add the nameservice configurations to all the hdfs-site.xml files in the clusters involved. Problem: We…
1
vote
0 answers

Hadoop distcp temporary folder

Does hadoop distcp creates tmporary folder on HDFS, when it copy from HDFS to Amazon S3a? Do we need additional 1 Tb free space on HDFS, when we want to copy 1Tb data from HDFS into S3a? Thanks.
Bohdan Kolesnyk
  • 135
  • 2
  • 7
1
vote
0 answers

Optimal way to maintain two hadoop clusters

Can I get the advantages & disadvantages of transfering data from a database to two separate hdfs clusters at once to one hdfs cluster then use distcp to move the data to the second cluster
John Engelhart
  • 279
  • 4
  • 13