0

I have the following code that I am writing to get values from a warc file. My goal is to find sites that have:

User-Agent: * 
Disallow: /

I would like it to only print URLs that have the above robots.txt rules ^

My Python code that currently only prints one line that is the URL:

file = 'robots.warc'
num_lines = sum(1 for line in open(file, errors='ignore'))
print('file has', num_lines , 'lines')

with open(file, errors='ignore') as lines:
    for line in lines:
        if line.startswith("WARC-Target-URI:"):
            print(line)

Here is an example warc file

Thanks for your help!

Trey Copeland
  • 3,387
  • 7
  • 29
  • 46

0 Answers0