How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

Question

I can obtain listing for Common Crawl by:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz

How can I do this with Common Crawl News Dataset ?

I tried different options, but always getting errors:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS-2017-09/warc.paths.gz

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/09/warc.paths.gz

score 1 · Accepted Answer · answered Mar 21 '21 at 15:34

Since every few hours a new WARC file is added to the news dataset, a static file list does not make sense. Instead you can get a list of files using the AWS CLI - for any subset by year or month, e.g.

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/

See also the news data release announcement.

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

1 Answers1