Questions tagged [distcp]

hadoop tool used for large inter- and intra-cluster copying.

The distcp command is a tool used for large inter- and intra- copying. It uses to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

181 questions
1
vote
0 answers

"Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4" when copying data from HDFS to S3

I am trying to use distcp to copy data from HDFS to S3, but I got an error: Caused by: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error…
michelle
  • 197
  • 2
  • 14
1
vote
0 answers

Hadoop distcp to S3 performance is very slow

I am trying to copy the data from HDFS to Amazon S3 using hadoop distcp. the amount of data is 227GB and the job has been running for more than 12 hours. Is there a hard limit of 3500 write requests for a S3 bucket ? and could this be causing the…
Hemanth
  • 705
  • 2
  • 16
  • 32
1
vote
2 answers

Moving data from hive views to aws s3

Hi is there any ways we could move data from hive views to S3? For tables I am using distcp but since views doesnt have data residing in HDFS location I wasn't able to do distcp and I don't have access for tables used in creating views. If I do CTAS…
Rajkumar
  • 189
  • 5
  • 19
1
vote
1 answer

Data Copy between ADLS Instances

Copying data between various instances of ADLS using DISTCP Hi All Hope you are doing well. We have a use case around using ADLS as different tiers of the ingestion process, just required you valuable opinions regarding the feasibility of the…
1
vote
3 answers

Copy files from a hdfs folder to another hdfs location by filtering with modified date using shell script

I have 1 year data in my hdfs location and i want to copy data for last 6 months into another hdfs location. Is it possible to copy data only for 6 months directly from hdfs command or do we need to write shell script for copying data for last 6…
Antony
  • 970
  • 3
  • 20
  • 46
1
vote
2 answers

Copy Files from NFS or Local FS to HDFS

I am trying to copy large number of files (100k+, total size 2 TB) from NFS to HDFS. What is the efficient way to do it. I have tried below options after mounting it to edge node hdfs dfs -put : It fails with memory error and transfer is also…
Arghya Saha
  • 227
  • 1
  • 4
  • 17
1
vote
2 answers

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
Kshitij Kohli
  • 4,055
  • 4
  • 19
  • 27
1
vote
1 answer

hadoop distributed copy overwrite not working

I am trying to use the org.apache.hadoop.tools.DistCp class to copy some files over into a S3 bucket. However overwrite functionality is not working in spite of explicitly setting the overwrite flag to true Copying works fine but it does not…
kevroger23
  • 31
  • 3
1
vote
1 answer

s3-dist-cp groupBy equivalent on Dataproc using hadoop distcp commands

On an EMR, I am using s3-dist-cp --groupBy in order to name the file with random fileName in a folder to a name that i wish to rename it to in S3: s3-dist-cp --groupBy='.*(folder_in_hdfs).*' --src=hdfs:///user/testUser/tmp-location/folder_in_hdfs…
dreddy
  • 463
  • 1
  • 7
  • 21
1
vote
0 answers

Distcp to S3 OK but cant list

Basically I'm using the command ´distcp´ to put some data in S3, pretty small files. I am sending the data into a bucket, and a put the files inside a folder of the bucket. It works fine and I can see on the logs that something has been written to…
DDDDEEEEXXXX
  • 97
  • 1
  • 6
1
vote
1 answer

AWS file upload

I want to upload few files into AWS bucket from hadoop. I have AWS ACCESS KEY, SECRET KEY and S3 IMPORT PATH. I am not able to access though AWS CLI command. I set the keys in aws credential file. I tried to do “ aws s3 ls” I am getting error as…
akr
  • 43
  • 7
1
vote
1 answer

distcp causing skewness in HDFS

I have a folder(around 2 TB in size) in HDFS, which was created using save method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck). When I try to distcp this folder (intra-cluster), and run hdfs fsck…
pri
  • 1,521
  • 2
  • 13
  • 26
1
vote
1 answer

ACLs not supported on at least one file system: Distcp HDFS

As per distcp documentation -> If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions. but hadoop distcp -pa -delete -update /src/path /dest/path/ is failing with ACLs not supported on at least one…
1
vote
0 answers

What is the fastest way to move data from one volume to another with MapR?

I want to move data from one volume to another. The folders and file sizes vary. Files can be up to 100 GB, but we can have also a lot of small files. If there is data in the destination volume at that particular folder, it can be overwritten. So…
Stefan Papp
  • 2,199
  • 1
  • 28
  • 54
1
vote
1 answer

HDFS Connector for Object Storage: Does not contain a valid host:port authority

I configured the HDFS Connector for Object storage as described here: https://docs.us-phoenix-1.oraclecloud.com/Content/API/SDKDocs/hdfsconnector.htm#troubleshooting When I am running distcp with the following command: hadoop distcp -libjars…
Matthias
  • 11
  • 1
  • 2