I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis.
I am moving from the older common crawl textData format to the newer warc.gz format and I need to know how I might go about specifying a range of warc.gz files for my EMR input.
For example:
In the older format I could specify an input range of textData files as such:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690165636/textData-000[0-9][0-9]
but the new format looks like this:
first file:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00000-ip-10-236-182-209.ec2.internal.warc.gz
second file:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00001-ip-10-236-182-209.ec2.internal.warc.gz
How would I specify to map a range of these warc.gz files?