Questions tagged [s3distcp]
60 questions
0
votes
1 answer
AWS-EMR-S3DISTCP - Does aws charges for s3 actions?
I was looking into the documentation of s3distcp (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html) but I was not able to find any explicit site where mention about costs of each action. This is a sample scenario:
I have a…

Cesar A. Mostacero
- 720
- 6
- 12
0
votes
1 answer
S3DistCp (AWS-EMR) - deleteOnSuccess option creates file on source bucket
I'm working on an AWS-EMR cluster and added a step to run S3DISTCP (https://docs.aws.amazon.com/es_es/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html), this is in order to copy objects from an s3 bucket (target/destination is also an s3 bucket).…

Cesar A. Mostacero
- 720
- 6
- 12
0
votes
1 answer
distcp is not executing
i am trying to copy data from one hdfs cluster to another using distcp command.following is the command which i submitted
hadoop distcp hdfs://sourcenamenodehostname:50070/var/lib/hadoop-hdfs/distcptest.txt…

Jibinjks
- 33
- 8
0
votes
1 answer
Scheduling output exporting from HDFS to S3
Trying to figure out how to export data from HDFS which is outputted by Apache Spark Streaming job. Following diagram defines solution architecture:
Apache Spark runs streaming job in AWS EMR cluster and stores result in HDFS. Streaming job…

Laurynas Stašys
- 328
- 4
- 16
0
votes
1 answer
Renaming and Moving Spark output file in AWS taking very yvery long tme
I have a spark job where I have huge file as output 300 gb to S3 .
My requirement is to rename all part files and then we have to move to final folder .
I did research but could not found solution where in spark job itself I can rename my spark…

Atharv Thakur
- 671
- 3
- 21
- 39
0
votes
1 answer
S3distcp on local hadoop cluster not working
I am trying to run s3distcp from my local hadoop pseudo cluster. As a result of executing s3distcp.jar i received the following stack-trace . It seems that reducer task is failing but I am not able to pinpoint the reason which could be causing…

Chirag Goyal
- 1
- 2
0
votes
2 answers
Permission Issue in using s3 dist cp to copy data from a non emr cluster to s3
To state my problem
1) I want to backup our cdh hadoop cluster to s3
2) We have an emr cluster running
3) I am trying to run s3distcp from emr cluster giving src as hdfs URL of the cdh remote cluster and destination as s3 .
Having following…

Naveen
- 392
- 4
- 14
0
votes
2 answers
java.lang.IllegalArgumentException: Both source file listing and source paths present
I am trying to copy files from HDFS to S3 using distcp by executing the following command
hadoop distcp -fs.s3a.access.key=AccessKey -fs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
But I am getting the…

Vinod Mhetre
- 3
- 3
0
votes
0 answers
Is there any aws-java-sdk to do s3-distcp?
I would like to do s3-distcp from my code.
My project is based on java, and I'm using aws-java-sdk for launching clusters and submitting hadoop jobs to the cluster.
Since, the output of the jobs need to be copied to s3, I am looking for an sdk that…

appsdownload
- 751
- 7
- 20
0
votes
1 answer
Copy to s3 location using distcp command
I am copying some data from HDFS to S3 using below command :
$ hadoop distcp -m 1 /user/hive/data/test/test_folder=2015_09_19_03_30 s3a://data/Test/buc/2015_09_19_03_30
2015_09_19_03_30 bucket does not exists into S3. It successfully copies the…

Mohit Rane
- 279
- 7
- 23
0
votes
1 answer
Hadoop distcp with file list
I would like to use distcp to copy a list of files (> 1K files) into hdfs. I have already stored list of files in local directory, now can I use -f to copy all files? if yes what is the format do I have to maintain in my files list file? or is…

Turbo Sullivan
- 827
- 1
- 7
- 9
0
votes
2 answers
s3-dist-cp and hadoop distcp job infinitely loopin in EMR
I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp:
s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/
hadoop distcp s3a://PathToFile/file1…

dreddy
- 463
- 1
- 7
- 21
0
votes
1 answer
Configure logging on AWS EMR for s3distcp
I would like to change s3distcp and other hadoop commands to log only WARN messages or worse, while currently it logs INFO and worse.
How can I configure this on the head node of an AWS EMR cluster?
Here's an example of the output that I am trying…

mgoldwasser
- 14,558
- 15
- 79
- 103
0
votes
1 answer
Hadoop distcp to S3a with hidden key pair
How can I hide ACCESS_ID_KEY and SECRET_ACCESS_KEY for access to Amazon S3?
I know about adding it to core-site.xml, but maybe there is different solutions. Becouse with this approach every user from cluster will run distcp with same keys. Maybe…

Bohdan Kolesnyk
- 135
- 2
- 7
-1
votes
1 answer
how to move hdfs files as ORC files in S3 using distcp?
I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of…

nagendra
- 1,885
- 3
- 17
- 27