2

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc files. At present i am able to read the warc file with the following code finely.

import warc
f = warc.open("firsttest.warc.gz")
h = warc.WARCHeader({"WARC-Type": "response",}, defaults=True)
N = 10
name="sports"
for record in f:
    url = record.header.get('warc-target-uri', 'none')
    date=record.header.get("WARC-Date")
    IP=record.header.get('WARC-IP-Address')
    payload_di=record.header.get('WARC-Payload-Digest')
    search =name in record.header
    print("URL :"+str(url))
    #print("date :"+str(date))
    #print("IP :"+str(IP))
    #print("payload_digest :"+str(payload_di))
    #print("search :"+str(search))
    text = record.payload.read()
    #print("Text :"+str(text))
    #break

    #print(url)

But it is getting all the urls in the specified warc file. I need only related urls that matches with "sports" or "football". How can i search for that keyword in warc files? Please help me in this as i am new to common crawl. I also checked lot of posts but none of them worked out.

I need to grab article image if they have , How can i grab it as commoncrawl saving entire webpage .?

Dinesh Manne
  • 207
  • 1
  • 15

1 Answers1

1

You can use the AWS Athena to query Common Crawl Index on S3. For example, here is my SQL query to find the "sports" and "football" matching URLs in July 2019 index. See this page - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

SELECT *
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2019-13'
AND subset = 'warc'
AND url_path like '%sports%' and url_path like '%football%'
Limit 10

common crawl index

Vikash Rathee
  • 1,776
  • 2
  • 25
  • 43