Reading the content of robots.txt in Python and printing it

Question

I want to check if a given website contains robot.txt, read all the content of that file and print it. Maybe also add the content to a dictionary would be very good.

I've tried playing with the robotparser module but can't figure out how to do it.

I would like to use only modules that come with the standard Python 2.7 package.

I did as @Stefano Sanfilippo suggested:

from urllib.request import urlopen

returned

    Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    from urllib.request import urlopen
ImportError: No module named request

So I tried:

import urllib2
from urllib2 import Request
from urllib2 import urlopen
with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

but got:

Traceback (most recent call last):

File "", line 1, in with urlopen("https://www.google.com/robots.txt") as stream: AttributeError: addinfourl instance has no attribute 'exit'

From bugs.python.org it seems that's something not supported in 2.7 version. As a matter of fact the code works fine with Python 3 Any idea how to work this around?

You don't need to know anything about the structure of the site to know where `robots.txt` has to be. It's always at `whatever.site.name/robots.txt`. — user2357112, Jul 19 '14 at 10:05
@jonsharpe I reworded the question. Is it sufficient narrow now? Question is solved but I was wondering if the status "on hold" could be removed.Thanks — The One Electronic, Jul 23 '14 at 13:35

Stefano Sanfilippo · Accepted Answer · 2014-07-20T10:27:33.773

3

Yes, robots.txt is just a file, download and print it!

Python 3:

from urllib.request import urlopen

with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

Python 2:

from urllib import urlopen
from contextlib import closing

with closing(urlopen("https://www.google.com/robots.txt")) as stream:
    print stream.read()

Note that the path is always /robots.txt.

If you need to put content in a dictionary, .split(":") and .strip() are your friends:

edited Jul 20 '14 at 10:27

answered Jul 19 '14 at 10:07

Stefano Sanfilippo

32,265
7
79
80

Your code works with Python 3 but not with Python 2.7 Can you suggest me how to make it work with Python 2.7? – The One Electronic Jul 20 '14 at 09:18
See the edit. However, you should _really_ be using Python 3, unless you have a specific reason to stick with Python 2. Python 2 is legacy, and it's not me to say it, [it's official](https://wiki.python.org/moin/Python2orPython3). – Stefano Sanfilippo Jul 20 '14 at 10:28
Thanks @Stefano Sanfilippo I'll check the tool 2to3 to convert my code. I don't know why I had the impression that using the 2.7 version was still a good idea. – The One Electronic Jul 20 '14 at 18:57

Reading the content of robots.txt in Python and printing it

1 Answers1