I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc files. At present i am able to read the warc file with the following code finely.
import warc
f = warc.open("firsttest.warc.gz")
h = warc.WARCHeader({"WARC-Type": "response",}, defaults=True)
N = 10
name="sports"
for record in f:
url = record.header.get('warc-target-uri', 'none')
date=record.header.get("WARC-Date")
IP=record.header.get('WARC-IP-Address')
payload_di=record.header.get('WARC-Payload-Digest')
search =name in record.header
print("URL :"+str(url))
#print("date :"+str(date))
#print("IP :"+str(IP))
#print("payload_digest :"+str(payload_di))
#print("search :"+str(search))
text = record.payload.read()
#print("Text :"+str(text))
#break
#print(url)
But it is getting all the urls in the specified warc file. I need only related urls that matches with "sports" or "football". How can i search for that keyword in warc files? Please help me in this as i am new to common crawl. I also checked lot of posts but none of them worked out.
I need to grab article image if they have , How can i grab it as commoncrawl saving entire webpage .?