how to specify wildcards in a filename for amazon EMR job

Question

If I run a EMR job and specify wildcards in the directory path it all works fine e.g: s3n://mybucket///*/fileName.gz --- picks all files with name fileName.gz under subdirectories of mybucket

However when I specify wildcards in the fileName then emr logs show an error that no match found. It seems to treat the '' character as a literal character part of fileName instead as a wildcard e.g: s3n//mybucket/Dir1/fileName..gz gives an error back that no matches were found for fielName.*.gz in that directory

How do we specify wildcards in filename for an amazon emr job

I was able to get it working by specifying the regular expression for filename like s3n://mydir/fileName-00.[0-9]0-9][0-9].gz. This matches a filename like fileName-00.123.gz, fileName-00.432.gz — user2330278, Jan 27 '14 at 18:25
What tool do you use ? For example, hive, pig, or something else. — hiropon, Apr 05 '17 at 05:06
Possible duplicate of [when creating an external table in hive can I point the location to specific files in a direcotry?](http://stackoverflow.com/questions/11269203/when-creating-an-external-table-in-hive-can-i-point-the-location-to-specific-fil) — hiropon, Apr 20 '17 at 00:29

score 0 · Answer 1 · answered Dec 20 '22 at 18:19

Just went through this myself. It is very useful to pass NON-globbed wildcard expressions from the start script to spark/pyspark because the distribution mechanism inside the spark program can be efficient when presented when something like this; note globbing at both directory level and filename level:

df = spark.read.json('s3://my-bucket/archive/*/2014/7/G.*.json.bz2')

Not to mention of course that almost all the time you want globbing to occur on the remote resource, not your local launch environment.

The trick is to ensure that the initial shell variable does not get globbed when created and also protected when presented to aws emr add-steps. Here is a simple launch script that assumes a cluster has been created. To show it can be done, we also escape newlines to make it easier to see the args. Be careful, however, NOT to re-introduce extra whitespace when doing this!

# Use single quotes to stop globbing at the var level:                            
DATA_URI='s3://my-bucket/archive/*/2014/7/G.*.json.bz2'

#  DO NOT add trailing slash to the output_uri.  S3 will              
#  automatically create subdirs under that.  e.g. 
#  --output_uri s3://$SRC_BUCKET/V4_t 
#  will be created and populated with many part-0000-... files.
#  If you are not renaming or deleting the output_uri for each run,
#  make sure your spark program uses overwrite mode for dataframe output e.g.
#          dfx.write.mode("overwrite").json(output_uri)
                            

#  Careful to protect the DATA_URI arg by wrapping it single quotes:              
aws emr add-steps \
    --cluster-id j-3CDMYEF3NJGHR \
    --steps Type=Spark,\
Name="myAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[\
s3://$SRC_BUCKET/blunders.py,\
--game_data,\'$DATA_URI\',
--output_uri,s3://$SRC_BUCKET/V4_t]

how to specify wildcards in a filename for amazon EMR job

1 Answers1