Mapping a range of warc.gz files, EMR

Question

I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis.

I am moving from the older common crawl textData format to the newer warc.gz format and I need to know how I might go about specifying a range of warc.gz files for my EMR input.

For example:

In the older format I could specify an input range of textData files as such:

s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690165636/textData-000[0-9][0-9]

but the new format looks like this:

first file:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00000-ip-10-236-182-209.ec2.internal.warc.gz

second file:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-00001-ip-10-236-182-209.ec2.internal.warc.gz

How would I specify to map a range of these warc.gz files?

Thats what Im asking - how do I specify the file range? For textdata files the numeric range is easy because the files are numbered, but the warc.gz files are numbered within each filename rather than at the end like textData files are. Check out the 00000 and 00001 in the two warc.ga examples above. How do specify the step run both? — DataGuy, Jul 07 '16 at 17:46

score 0 · Answer 1 · answered Jul 07 '16 at 17:57

I'm pretty sure you can use the same method you were using previously. To just read the two files you would use:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454702039825.90/warc/CC-MAIN-20160205195359-0000[0-1]-ip-10-236-182-209.ec2.internal.warc.gz

Also since these paths are richer than the previous one you have additional ways to specify sets of data to process.

CC-MAIN-2016-07 is CC-MAIN-YYYY-ww - Ability to specify a set of years or weeks to process.

CC-MAIN-20160205195359 is CC-MAIN-YYYYMMDDHHmmss - You can choose a date or time range.

score 0 · Answer 2 · answered Aug 16 '16 at 18:03

You can download the list of warc file of july 2016 via

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/warc.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/wat.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/wet.paths.gz

for accessing via browser attach this to the path mentioned in the file

commoncrawl.s3.amazonaws.com/

In Your case to access via s3 try appending this to the path

s3://commoncrawl/

Mapping a range of warc.gz files, EMR

2 Answers2