0

In Common Crawl same URL can be harvested multiple times.

For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added.

Is there a way to find when a given URL has been crawled for the first by Common Crawl?

dzieciou
  • 4,049
  • 8
  • 41
  • 85

1 Answers1

1

The URL indexes (CDX or columnar) include a field/column with the capture time. Just search for the URL, record all captures and then look into the page content of the captures regarding the addition of comments. The indexes also include the WARC file name, record offset and length which allow to fetch the WARC record using a HTTP range request.

Sebastian Nagel
  • 2,049
  • 10
  • 10
  • Sounds good. For CDX indexes hosted on For 200 URLs only this will take around 4h and I go back only to 2016 crawles (first crawles started in 2008). For a paid option with columnar indexes I guess it would go much faster. – dzieciou Mar 08 '21 at 08:00