4

I have files like this in S3:

1-2013-08-22-22-something
2-2013-08-22-22-something
etc

without srcPattern I can get all of the files from the bucket easily but I want to get a specific prefix, for example all of the 1's. I've tried using srcPattern but for some reason it's not picking up any of the files.

My current command is:

elastic-mapreduce --jobflow $JOBFLOW --jar /home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3n://some-bucket/,--dest,hdfs:///hdfs-input,--srcPattern,[0-9]-.*' \
--step-name "copying over s3 files" 
Julian
  • 483
  • 1
  • 6
  • 17

1 Answers1

8

Turns out you need .* in front of the regex

for example I needed

.*[0-9]-.*

I'm guessing because the source pattern also includes the bucket name?

Julian
  • 483
  • 1
  • 6
  • 17
  • 3
    This means we can control files included using regexp on the full path which is nice feature and should be clearly documented. – keiw Oct 30 '13 at 03:27
  • I had the same problem and solve using something like: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://source/ --srcPattern=.*events_2022_03/serial=0000.*|.*events_2022_04/serial=0000.* --dest=hdfs:///user/spark/events/ – Paulo Moreira Aug 08 '22 at 20:32