Python requests vs. robots.txt

Question

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.

I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:

'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'

I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html

Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)

EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.

you can try to insert normal browser's user-agent in your request header — andypp, Nov 10 '13 at 15:24
Maybe it's time to review the website's Terms of Service to see if there have been changes there - are you sure your site scraping is something the website owner wants to allow? There's more to TOS than whether you are putting a burden on the server. — PaulMcG, Nov 10 '13 at 16:13
Yes, I did check the TOS. They disallow any access faster than a human could produce in a web brwoser: `You agree not to use or launch any automated system, including without limitation, "robots," "spiders," "offline readers," etc. , that accesses the Service in a manner that sends more request messages to the Company servers than a human can reasonably produce in the same period of time by using a conventional on-line web browser` I send one request per day, so I think I fall into the legally acceptable area. — Austin, Nov 10 '13 at 16:18
Do you get that same message if you try to hit the site with a browser? Also, what does the robots.txt file say? In any case, robots.txt and the robots meta tag can't prevent your bot from downloading. — Jim Mischel, Nov 11 '13 at 02:06
I can use the website like normal with any browser. I just checked the robots.txt and it reads: — Austin, Nov 12 '13 at 19:42
`User-agent: * Disallow: /search.php Disallow: /searchc.php Disallow: /status.php Disallow: /contact.php User-agent: Baiduspider Disallow: /search.php Disallow: /searchc.php Disallow: /status.php Disallow: /contact.php User-agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) Disallow: /search.php Disallow: /searchc.php Disallow: /status.php Disallow: /contact.php User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm) Disallow: /search.php Disallow: /searchc.php Disallow: /status.php Disallow: /contact.php` — Austin, Nov 12 '13 at 19:42

score 13 · Answer 1 · edited May 15 '16 at 19:26

What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.

For example requests sets the user-agent to something like python-requests/2.9.1

You can specify the headers your self.

url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1", 
       "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
       "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

ua = UAS[random.randrange(len(UAS))]

headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

Wow, there's a flashback. I used to do a little scraping with Perl back in the late-90's, and I learned about the user agent. I'm just getting back into it in 2022 with python and soup. And user agent is still a thing. Who knew. — J B, Jul 15 '22 at 19:15

Python requests vs. robots.txt

1 Answers1

Linked