2

The common crawl index file used in the below project

https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy

mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')

is a partial one.

I want the complete index file(APRIL-2015 crawl data) to use in my project which uses the above project as a base.

Where can I download the entire index file?

Here Tom Morris states that

The index files which are used by the index service are also available for download.

Vanaja Jayaraman
  • 753
  • 3
  • 18

1 Answers1

4

Common crawl index files are publicly available at s3://commoncrawl/cc-index/collections/

You can check out all the crawl indexes available by aws command line: aws s3 ls s3://commoncrawl/cc-index/collections/

Index files for April 2015 are at s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/

If you want to download index *.gz files via http protocol, you can do:

https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz

cdx files are mostly from cdx-00000.gz up to cdx-00299.gz, so complete index is contained in 300 files.

m5khan
  • 2,667
  • 1
  • 27
  • 37