Download Common crawl complete index file

Question

The common crawl index file used in the below project

mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')

I want the complete index file(APRIL-2015 crawl data) to use in my project which uses the above project as a base.

Where can I download the entire index file?

Here Tom Morris states that

The index files which are used by the index service are also available for download.

m5khan · Accepted Answer · 2016-07-29T07:28:40.270

Common crawl index files are publicly available at s3://commoncrawl/cc-index/collections/

You can check out all the crawl indexes available by aws command line: aws s3 ls s3://commoncrawl/cc-index/collections/

Index files for April 2015 are at s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/

If you want to download index *.gz files via http protocol, you can do:

https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz

cdx files are mostly from cdx-00000.gz up to cdx-00299.gz, so complete index is contained in 300 files.

1 Answers1