0

I have crawled data from common crawl and I want to find out url corresponding to each of the records.

for record in files:
     print record['WARC-Target-URI']

This outputs an empty list. I am referring to the following link https://dmorgan.info/posts/common-crawl-python/. Do we get target uri corresponding to each of the record or just one target uri for one warc file path ?

Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
  • It's hard to understand what the reason could be without detailed logs. – Sebastian Nagel Jul 18 '17 at 07:45
  • Did you update the examples from [dmorgan.info](https://dmorgan.info/posts/common-crawl-python/) so that URLs and paths point to the correct data location. The data has been moved last year to the bucket s3://commoncrawl/ (cf. [CC group](https://groups.google.com/d/topic/common-crawl/nKuQK68rebo/discussion)): 1. remove the path prefix `common-crawl/` 2. change the host in URLs to `commoncrawl.s3.amazonaws.com`. `https://aws-publicdatasets.s3.amazonaws.com/common-crawl/` becomes `https://commoncrawl.s3.amazonaws.com/` – Sebastian Nagel Jul 18 '17 at 07:54
  • yes, i have the paths accordingly and I can see the value of record.payload.read() but record['WARC-Target-URI'] returns nothing. So is the case with record['Content-Language'] – Ravi Ranjan Jul 18 '17 at 08:38
  • The `record['WARC-Target-URI']` should be there except for the first "warcinfo" record. `Content-Language` is not part of the WARC record header. It's part of the HTTP header which in `record.payload`. – Sebastian Nagel Jul 18 '17 at 08:53

1 Answers1

1

The info you're after is part of the header. Try:

print record.header['WARC-Target-URI']

Mark Lapierre
  • 1,067
  • 9
  • 15