Questions tagged [distcp]

hadoop tool used for large inter- and intra-cluster copying.

The distcp command is a tool used for large inter- and intra- copying. It uses to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

181 questions
3
votes
1 answer

Set YARN application name for Hadoop Distcp job

NOTE: I don't want to specify a YARN-queue name as in Hadoop: specify yarn queue for distcp I frequently use hadoop distcp for moving data around HDFS and would like to have a descriptive application name for these jobs. Presently all copying jobs…
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
3
votes
1 answer

Asisstance in reducing execution time of distcp operation

We have many distcp jobs copying data from our primary cluster to our backup cluster. these jobs run all day and copy almost all tables of critical databases. We use webhdfs here. Some of these jobs run for hours ( for tables (ORC format ones )that…
Kumar
  • 119
  • 10
3
votes
2 answers

Hadoop distcp -possible to keep each file identical (retain file size)?

When I run a simple distcp command: hadoop distcp s3://src-bucket/src-dir s3://dest-bucket/dest-dir I get a slight discrepancy on the size (in bytes) of src-dir and dest-dir >aws s3 --summarize s3://dest-bucket/dest-dir/ ... Total Objects: 12290 …
pl0u
  • 365
  • 7
  • 16
3
votes
1 answer

Distcp - Container is running beyond physical memory limits

I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case: USE CASE I have a main folder in a certain location say /hdfs/root, with a lot of subdirs (deepness is not fixed) and files. Volume: 200,000…
GwydionFR
  • 787
  • 1
  • 10
  • 25
3
votes
2 answers

DistCp fault tolerance between two remote clusters

I need to copy a directory of from one cluster to another with similar HDFS (both are MAPR clusters). I am planed to use DistCp Java API. But I wanted to avoid duplicate copies of files in the directory. I wanted to know whether these operations are…
Mahdi
  • 787
  • 1
  • 8
  • 33
3
votes
4 answers

distcp from Hadoop to S3 fails with "No space available in any of the local directories"

I'm trying to copy data from a local hadoop cluster to an S3 bucket using distcp. Sometimes it "works", but some of the mappers fail, with the stack trace below. Other times, so many mappers fail that the whole job cancels. The error "No space…
Zack
  • 301
  • 1
  • 4
  • 9
3
votes
1 answer

Does distcp in hadoop ENCRYPT data while transporting from one cluster to another

I would like to know whether distcp has option to encrypt data while transporting from one cluster to another. I got to know that it does support encryption in S3 cluster but that is something to do with amazon's S3. What if we are moving plain text…
3
votes
2 answers

distcp hdfs to s3 fails

I was trying to do one directory which has hundreds os small files with extension .avro but it fails for some files with following error : 14/09/18 13:05:19 INFO mapred.JobClient: map 99% reduce 0% 14/09/18 13:05:22 INFO mapred.JobClient: map 100%…
roy
  • 6,344
  • 24
  • 92
  • 174
3
votes
0 answers

hadoop distcp bandwidth issue

I am doing distcp from one hadoop cluster(version 0.20.2) to another hadoop cluster(version 2.2.0) using below command. hadoop distcp -update -skipcrccheck "hftp://x.x.x.x:50070//hive/warehouse//staging_eventlog_arpu_comma" …
user2950086
  • 135
  • 1
  • 1
  • 13
3
votes
0 answers

distcp s3 instance profile temporary credentials

I'm using distcp on my hadoop cluster in AWS. Now we are switching over to use IAM roles for the cluster nodes. A solution I was going to try was add in my own implementation of org.apache.hadoop.fs.s3native.NativeS3FileSystem that would be smarter…
Elan H
  • 33
  • 3
3
votes
1 answer

HDFS LeaseExpiredException

I have an application which is supposed to copy over a large number of files from a source such as S3 into HDFS. The application uses apache distcp within and copies each individual file from the source via streaming into HDFS. Each individual file…
user1084874
  • 141
  • 1
  • 1
  • 8
2
votes
0 answers

Error in accessing google cloud storage bucket via hadoop fs -ls that runs on Cloudera Hadoop CDH 6.3.3 integrated with Kerberos/SSL/LDAP cluster

I am getting the below error while accessing a Google Cloud Storage bucket for the first time via Cloudera CDH 6.3.3 Hadoop Cluster. I am running the command on the edge node where Google Cloud SDK is installed. Reachability of Google Storage is…
2
votes
1 answer

Hadoop distcp copy from on prem to gcp strange behavior

when I user distcp command as hadoop distcp /a/b/c/d gs:/gcp-bucket/a/b/c/ , where d is a folder on HDFS containing subfolders. If folder c is already there on gcp then it copies d ( and its subfolders) from HDFS to gcp inside c but if c folder…
Vicky
  • 1,298
  • 1
  • 16
  • 33
2
votes
1 answer

move data from hdfs to s3 using session based token auth

Can someone please help me with authentication while moving the data from hdfs to S3. To connect to S3, I am generating session based credentials using aws_key_gen (access_key, secret_key, and session based token) I tested, distcp works fine with…
Manu Batham
  • 331
  • 1
  • 14
2
votes
4 answers

Hadoop Distcp aborting when copying data from one cluster to another

I am trying to copy data of a partitioned Hive table from one cluster to another. I am using distcp to copy the data but the data underlying data is of a partitioned hive table. I used the following command. hadoop distcp -i {src} {tgt} But as the…
ismail basha
  • 21
  • 1
  • 4
1
2
3
12 13