Questions tagged [s3distcp]

60 questions
2
votes
1 answer

Step failed with exitCode, Amazon Emr Hadoop, S3DistCp

I'm trying to create a "Step" and gather many small files into one, so I can separate it for days. The problem is that I'm intetando run and not let me. Executing it works well for me command: hadoop distcp s3n://buket-name/output-files-hive/*…
David
  • 119
  • 1
  • 12
1
vote
0 answers

Running distcp java job using hadoop yarn

I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this: import org.apache.hadoop.tools.DistCp; import org.apache.hadoop.tools.DistCpOptions; import org.apache.hadoop.tools.OptionsParser; import…
Divya
  • 31
  • 1
  • 4
1
vote
1 answer

-Dmapred.job.name does not work with s3-dist-cp command

I'd like to copy some files from emr-hdfs to s3 bucket using s3-dist-cp, I've tried this cmd from "EMR Master Node": s3-dist-cp -Dmapred.job.name=my_copy_job --src hdfs:///user/hadoop/abc s3://my_bucket/my_key/ this command executes fine but when I…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
1
vote
1 answer

s3distcp fail with "mapreduce_shuffle does not exist"

When I running command below, s3-dist-cp --src s3://test/9.19 --dest hdfs:///user/hadoop/test I got a error about auxService. 20/02/03 07:52:13 INFO mapreduce.Job: Task Id : attempt_1580716305878_0001_m_000000_2, Status : FAILED Container launch…
Gon
  • 11
  • 2
1
vote
2 answers

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
Kshitij Kohli
  • 4,055
  • 4
  • 19
  • 27
1
vote
1 answer

s3-dist-cp groupBy equivalent on Dataproc using hadoop distcp commands

On an EMR, I am using s3-dist-cp --groupBy in order to name the file with random fileName in a folder to a name that i wish to rename it to in S3: s3-dist-cp --groupBy='.*(folder_in_hdfs).*' --src=hdfs:///user/testUser/tmp-location/folder_in_hdfs…
dreddy
  • 463
  • 1
  • 7
  • 21
1
vote
1 answer

AWS file upload

I want to upload few files into AWS bucket from hadoop. I have AWS ACCESS KEY, SECRET KEY and S3 IMPORT PATH. I am not able to access though AWS CLI command. I set the keys in aws credential file. I tried to do “ aws s3 ls” I am getting error as…
akr
  • 43
  • 7
1
vote
1 answer

Copying files from HDFS to S3 on EMR cluster using S3DistCp

I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception: 8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91% 18/06/26 10:53:14 INFO mapreduce.Job: Task Id :…
Waqar Ahmed
  • 5,005
  • 2
  • 23
  • 45
1
vote
1 answer

s3distcp copy from S3 to EMR HDFS data replica always on one node

I am using s3distcp to copy a 500GB dataset into my EMR cluster. It's a 12 node r4.4xlarge cluster each with 750GB disk. It's using the EMR release label emr-5.13.0 and I'm adding Hadoop: Amazon 2.8.3, Ganglia: 3.7.2 and Spark 2.3.0. I'm using the…
1
vote
1 answer

Using GroupBy while copying from HDFS to S3 to merge files within a folder

I have the following folders in HDFS…
Amistad
  • 7,100
  • 13
  • 48
  • 75
1
vote
1 answer

s3DistCp order of concatenation of files

I am trying to use the S3DistCp tool on AWS EMR to merge multiple files (1.txt, 2.txt, 3.txt) to a single gzip file. I am using the groupBy flag. For now the output seems like the concatenation of source files in the reverse order by name. So the…
namitha gs
  • 13
  • 3
1
vote
1 answer

s3distcp copy files and directory from HDFS to S3 in a single command

I have below 2 files and 1 directory in HDFS. -rw-r--r-- 1 hadoop hadoop 11194859 2017-05-05 19:53 hdfs:///outputfiles/abc_output.txt drwxr-xr-x - hadoop hadoop 0 2017-05-05 19:28 hdfs:///outputfiles/sample_directory -rw-r--r-- 1…
sashmi
  • 97
  • 1
  • 2
  • 14
1
vote
1 answer

Deduce the HDFS path at runtime on EMR

I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip. The first EMR step is: hadoop fs -mkdir /input - This step…
sashmi
  • 97
  • 1
  • 2
  • 14
1
vote
1 answer

Adding S3DistCp to PySpark

I'm trying to add S3DistCp to my local, standalone Spark install. I've downloaded S3DistCp: aws s3 cp s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar . And the AWS SDK as well: wget…
Mark J Miller
  • 4,751
  • 5
  • 44
  • 74
1
vote
2 answers

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.
braj
  • 2,545
  • 2
  • 29
  • 40