How can I make my Scalding job operate recursively on its input bucket?

Question

I have a Scalding job which runs on EMR. It runs on an S3 bucket containing several files. The source looks like this:

MultipleTextLineFiles("s3://path/to/input/").read
  /* ... some data processing ... */
  .write(Tsv("s3://paths/to/output/))

I want to make it run on a nested bucket, i.e. a bucket containing buckets which themselves contain files. It should process all the files in the inner buckets. If I try to do this without altering the source, I get this error:

java.io.IOException: Not a file: s3://path/to/innerbucket

How can I alter this job to make it run on a nested bucket?

score 0 · Accepted Answer · answered May 04 '16 at 23:27

0

Use a wildcard:

s3://path/to/input/*

If you have multiple levels of nesting, use multiple wildcards to get to the files:

s3://path/to/input/*/*/*

answered May 04 '16 at 23:27

Dan Osipov

1,429
12
15

I see that this also allows regex-style character choices, e.g. `s3://path/to/input/id-[35]` for files `id-3` and `id-5`. Do you have a link to the documentation for this wildcard syntax for reading S3 from EMR? Can other regular expression primitives like alternation (`this|that`) be used? – fblundun May 06 '16 at 10:16

How can I make my Scalding job operate recursively on its input bucket?

1 Answers1