Questions tagged [distcp]

hadoop tool used for large inter- and intra-cluster copying.

The distcp command is a tool used for large inter- and intra- copying. It uses to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

181 questions
2
votes
1 answer

hadoop distcp doesnot create folder when we pass single file

I am facing below issues in hadoop Distcp any suggestion or help is highly appreciated. I am trying to copy data from Google Cloud platform to Amazon S3 1) When we have multiple files to copy from source to destination (This work fine) val…
Bhavesh
  • 909
  • 2
  • 23
  • 38
2
votes
1 answer

Is it possible to distcp files (comma separated) using distcp between 2 Hdfs directories (separate clusters)?

I have the need to only distcp x number of files. Couldn't find a way to do it. One idea is to copy it over a temporary directory and then distcp that directory. Once complete I can delete that temp directory. Individual distcp commands (for each…
2
votes
0 answers

How to calculate how fast the data transfer in hadoop is using distcp

I'm using distcp to move files between two hadoop clusters. How can I check the rate at which the data is moving between any two clusters?
sui
  • 353
  • 1
  • 3
  • 7
2
votes
1 answer

Does Hadoop Distcp copy at block level?

Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel. However when I…
Tianqi Tong
  • 83
  • 1
  • 7
2
votes
3 answers

Distcp Mismatch in length of source

I am facing issue while executing distcp command between two different hadoop clusters, Caused by: java.io.IOException: Mismatch in length of source:hdfs://ip1/xxxxxxxxxx/xxxxx and …
Aditya
  • 21
  • 4
2
votes
0 answers

Download large volumes from S3 to Local Machine? - s3distcp

Currently using distcp is slow, taking up to 4:16 minutes to copy 1 hour's worth of logs, while a custom function wrote by me only takes 16 seconds. Given that Amazon provides s3distcp examples involving logs, I thought to give this a go and test…
ylun.ca
  • 2,504
  • 7
  • 26
  • 47
2
votes
1 answer

Securely transferring data from HDFS to amazon S3 using distcp

We want to backup the HDFS data in our Cloudera Hadoop cluster to Amazon S3. Looks like we can use distcp for this but what is not clear is if the data is copied to S3 over an encrypted transport. Is there something that needs to be configured to…
Marco Di Cesare
  • 133
  • 2
  • 7
2
votes
0 answers

Is there a way to pull entire directory thru webhdfs in hadoop?

We have two clusters, where our requirement is to pull data from one cluster to another. Only option available to us is, pull the data thru webhdfs!! But unfortunately, what we can see is, thru webhdfs we can only pull only one file at a time, that…
Raja
  • 513
  • 5
  • 18
1
vote
0 answers

Hadoop distcp does not skip CRC checks

I have an issue with skipping CRC checks between source and target paths running distcp. I copy and decrypt files on demand and their checksum is different, that is expected. My command looks like following: hadoop distcp -skipcrccheck -update…
1
vote
0 answers

Unable to copy HDFS data to S3 bucket

I have an issue related to a similar question asked before. I'm unable to copy data from HDFS to an S3 bucket in IBM Cloud. I use command: hadoop distcp hdfs://namenode:9000/user/root/data/ s3a://hdfs-backup/ I've added extra properties in…
dddroog
  • 33
  • 7
1
vote
1 answer

hdfs distcp failing to copy from hdfs to s3

We have a snowball configured in our inhouse staging node with end-point http://10.91.16.213:8080. It all works properly, I can even list files in this snowball via s3 cli command aws s3 ls my-bucket/data/…
Anum Sheraz
  • 2,383
  • 1
  • 29
  • 54
1
vote
0 answers

Running distcp java job using hadoop yarn

I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this: import org.apache.hadoop.tools.DistCp; import org.apache.hadoop.tools.DistCpOptions; import org.apache.hadoop.tools.OptionsParser; import…
Divya
  • 31
  • 1
  • 4
1
vote
1 answer

-Dmapred.job.name does not work with s3-dist-cp command

I'd like to copy some files from emr-hdfs to s3 bucket using s3-dist-cp, I've tried this cmd from "EMR Master Node": s3-dist-cp -Dmapred.job.name=my_copy_job --src hdfs:///user/hadoop/abc s3://my_bucket/my_key/ this command executes fine but when I…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
1
vote
1 answer

How to change hadoop distcp staging directory

when I ran command hadoop distcp -update hdfs://path/to/a/file.txt hdfs://path/to/b/ I got an Java IOException: java.io.IOException: Mkdirs failed to create /some/.staging/directory However, I don't want to use "/some/file/path" as a temporary…
konchy
  • 573
  • 5
  • 16
1
vote
0 answers

distcp copy hive table transactional

I have a database in hive that I want to copy to a new database with crypto. Tables are transactionals. I used distcp to copy from the first db to the new encrypt one hadoop distcp -skipcrccheck -update /warehouse/tablespace/managed/hive/old_dbs.db…
1 2
3
12 13