0

I have a Scalding job which runs on EMR. It runs on an S3 bucket containing several files. The source looks like this:

MultipleTextLineFiles("s3://path/to/input/").read
  /* ... some data processing ... */
  .write(Tsv("s3://paths/to/output/))

I want to make it run on a nested bucket, i.e. a bucket containing buckets which themselves contain files. It should process all the files in the inner buckets. If I try to do this without altering the source, I get this error:

java.io.IOException: Not a file: s3://path/to/innerbucket

How can I alter this job to make it run on a nested bucket?

fblundun
  • 987
  • 7
  • 19

1 Answers1

0

Use a wildcard:

s3://path/to/input/*

If you have multiple levels of nesting, use multiple wildcards to get to the files:

s3://path/to/input/*/*/*
Dan Osipov
  • 1,429
  • 12
  • 15
  • I see that this also allows regex-style character choices, e.g. `s3://path/to/input/id-[35]` for files `id-3` and `id-5`. Do you have a link to the documentation for this wildcard syntax for reading S3 from EMR? Can other regular expression primitives like alternation (`this|that`) be used? – fblundun May 06 '16 at 10:16