cannot find url from a warc file crawled from common crawl

Question

I have crawled data from common crawl and I want to find out url corresponding to each of the records.

for record in files:
     print record['WARC-Target-URI']

This outputs an empty list. I am referring to the following link https://dmorgan.info/posts/common-crawl-python/. Do we get target uri corresponding to each of the record or just one target uri for one warc file path ?

It's hard to understand what the reason could be without detailed logs. — Sebastian Nagel, Jul 18 '17 at 07:45
Did you update the examples from [dmorgan.info](https://dmorgan.info/posts/common-crawl-python/) so that URLs and paths point to the correct data location. The data has been moved last year to the bucket s3://commoncrawl/ (cf. [CC group](https://groups.google.com/d/topic/common-crawl/nKuQK68rebo/discussion)): 1. remove the path prefix `common-crawl/` 2. change the host in URLs to `commoncrawl.s3.amazonaws.com`. `https://aws-publicdatasets.s3.amazonaws.com/common-crawl/` becomes `https://commoncrawl.s3.amazonaws.com/` — Sebastian Nagel, Jul 18 '17 at 07:54
yes, i have the paths accordingly and I can see the value of record.payload.read() but record['WARC-Target-URI'] returns nothing. So is the case with record['Content-Language'] — Ravi Ranjan, Jul 18 '17 at 08:38
The `record['WARC-Target-URI']` should be there except for the first "warcinfo" record. `Content-Language` is not part of the WARC record header. It's part of the HTTP header which in `record.payload`. — Sebastian Nagel, Jul 18 '17 at 08:53

score 1 · Answer 1 · answered Jul 18 '17 at 12:37

1

The info you're after is part of the header. Try:

print record.header['WARC-Target-URI']

answered Jul 18 '17 at 12:37

Mark Lapierre

1,067
9
15

cannot find url from a warc file crawled from common crawl

1 Answers1