Questions tagged [s3distcp]
60 questions
2
votes
1 answer
Step failed with exitCode, Amazon Emr Hadoop, S3DistCp
I'm trying to create a "Step" and gather many small files into one, so I can separate it for days. The problem is that I'm intetando run and not let me.
Executing it works well for me command:
hadoop distcp s3n://buket-name/output-files-hive/*…

David
- 119
- 1
- 12
1
vote
0 answers
Running distcp java job using hadoop yarn
I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this:
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.apache.hadoop.tools.OptionsParser;
import…

Divya
- 31
- 1
- 4
1
vote
1 answer
-Dmapred.job.name does not work with s3-dist-cp command
I'd like to copy some files from emr-hdfs to s3 bucket using s3-dist-cp, I've tried this cmd from "EMR Master Node":
s3-dist-cp -Dmapred.job.name=my_copy_job --src hdfs:///user/hadoop/abc s3://my_bucket/my_key/
this command executes fine but when I…

TheCodeCache
- 820
- 1
- 7
- 27
1
vote
1 answer
s3distcp fail with "mapreduce_shuffle does not exist"
When I running command below,
s3-dist-cp --src s3://test/9.19 --dest hdfs:///user/hadoop/test
I got a error about auxService.
20/02/03 07:52:13 INFO mapreduce.Job: Task Id : attempt_1580716305878_0001_m_000000_2, Status : FAILED
Container launch…

Gon
- 11
- 2
1
vote
2 answers
Is it possible to specify the number of mappers-reducers while using s3-dist-cp?
I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?

Kshitij Kohli
- 4,055
- 4
- 19
- 27
1
vote
1 answer
s3-dist-cp groupBy equivalent on Dataproc using hadoop distcp commands
On an EMR, I am using s3-dist-cp --groupBy in order to name the file with random fileName in a folder to a name that i wish to rename it to in S3:
s3-dist-cp --groupBy='.*(folder_in_hdfs).*' --src=hdfs:///user/testUser/tmp-location/folder_in_hdfs…

dreddy
- 463
- 1
- 7
- 21
1
vote
1 answer
AWS file upload
I want to upload few files into AWS bucket from hadoop. I have
AWS ACCESS KEY, SECRET KEY and S3 IMPORT PATH.
I am not able to access though AWS CLI command.
I set the keys in aws credential file.
I tried to do “ aws s3 ls”
I am getting error as…

akr
- 43
- 7
1
vote
1 answer
Copying files from HDFS to S3 on EMR cluster using S3DistCp
I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception:
8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91%
18/06/26 10:53:14 INFO mapreduce.Job: Task Id :…

Waqar Ahmed
- 5,005
- 2
- 23
- 45
1
vote
1 answer
s3distcp copy from S3 to EMR HDFS data replica always on one node
I am using s3distcp to copy a 500GB dataset into my EMR cluster. It's a 12 node r4.4xlarge cluster each with 750GB disk. It's using the EMR release label emr-5.13.0 and I'm adding Hadoop: Amazon 2.8.3, Ganglia: 3.7.2 and Spark 2.3.0. I'm using the…

Gareth Rogers
- 13
- 3
1
vote
1 answer
Using GroupBy while copying from HDFS to S3 to merge files within a folder
I have the following folders in HDFS…

Amistad
- 7,100
- 13
- 48
- 75
1
vote
1 answer
s3DistCp order of concatenation of files
I am trying to use the S3DistCp tool on AWS EMR to merge multiple files (1.txt, 2.txt, 3.txt) to a single gzip file. I am using the groupBy flag. For now the output seems like the concatenation of source files in the reverse order by name.
So the…

namitha gs
- 13
- 3
1
vote
1 answer
s3distcp copy files and directory from HDFS to S3 in a single command
I have below 2 files and 1 directory in HDFS.
-rw-r--r-- 1 hadoop hadoop 11194859 2017-05-05 19:53 hdfs:///outputfiles/abc_output.txt
drwxr-xr-x - hadoop hadoop 0 2017-05-05 19:28 hdfs:///outputfiles/sample_directory
-rw-r--r-- 1…

sashmi
- 97
- 1
- 2
- 14
1
vote
1 answer
Deduce the HDFS path at runtime on EMR
I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp.
This cluster is an on-demand cluster so we are not keeping track of the ip.
The first EMR step is:
hadoop fs -mkdir /input - This step…

sashmi
- 97
- 1
- 2
- 14
1
vote
1 answer
Adding S3DistCp to PySpark
I'm trying to add S3DistCp to my local, standalone Spark install. I've downloaded S3DistCp:
aws s3 cp s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar .
And the AWS SDK as well:
wget…

Mark J Miller
- 4,751
- 5
- 44
- 74
1
vote
2 answers
How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x
I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.

braj
- 2,545
- 2
- 29
- 40