Using s3distcp with Amazon EMR to copy a single file

Question

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help.

My code is as follows:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'

Exception thrown:

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more

What if you have many 15 GB files at a given location in s3, but your job needs only one of them and you want to have this file in your local hdfs via s3distcp! — Amar, Dec 14 '12 at 06:44

score 2 · Answer 1 · answered Jun 10 '13 at 03:32

2

DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.

On the hadoop master node, you can copy a single file using

hadoop fs -cp s3://<mybucket>/<path> hdfs:///output

answered Jun 10 '13 at 03:32

prestomation

7,225
3
39
37

Thanks. Though it might not be intended but you certainly can copy it using S3distcp. Consider the scenario when you have an automated pipeline run where in cluster is launched and steps are added in those scenarios s3distcp comes in handy. Now, say I have a SINGLE 20GB gzip file which would amount to a single mapper running for hours(around 10 hours in our case); using it with s3distcp's '--outputCodec none' option, it not only copies the files to HDFS but decompresses the file allowing hadoop to create input splits, thus letting us use more than one mappers(time reduced to 2 hours). – Amar Jun 18 '13 at 17:52
I should add that s3distcp does not work when I try to copy a single file from s3. I *have to* specify a prefix and then pattern to get the file I need. Not obvious from the documentation at all. – Tim Apr 01 '16 at 21:48

score 1 · Accepted Answer · answered Jun 18 '13 at 18:12

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Using s3distcp with Amazon EMR to copy a single file

2 Answers2