I have the following code that I am writing to get values from a warc file. My goal is to find sites that have:
User-Agent: *
Disallow: /
I would like it to only print URLs that have the above robots.txt rules ^
My Python code that currently only prints one line that is the URL:
file = 'robots.warc'
num_lines = sum(1 for line in open(file, errors='ignore'))
print('file has', num_lines , 'lines')
with open(file, errors='ignore') as lines:
for line in lines:
if line.startswith("WARC-Target-URI:"):
print(line)
Here is an example warc file
Thanks for your help!