21

Consider the following Python code:

 30    url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
 31    url_object = urllib.request.urlopen(url);
 32    print(url_object.read());

When this is run, an Exception is thrown:

File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default
   raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?

Any thoughts?

Kara
  • 6,115
  • 16
  • 50
  • 57
AgentLiquid
  • 3,632
  • 7
  • 26
  • 30

4 Answers4

28

this should do the trick

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,} 

request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need
  • 1
    Could you please format your code? (Just select it and press ctrl-k.) – Stephan202 May 12 '09 at 20:52
  • +1 This works perfectly. However, if you request from the google API too often, it will block your requests (i.e. throw errors). I've rate limited mine to once every 3 seconds and I don't seem to be getting blocked anymore. – Spike Aug 27 '12 at 04:37
26

If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.

Kevin Lacquement
  • 5,057
  • 3
  • 25
  • 30
  • 1
    Do you have idea what's going on under the hood though? I'm curious ... why doesn't url.read() look like a standard browser read? – AgentLiquid Mar 01 '09 at 21:24
  • 2
    Instead of going through the web interfaces, these APIs directly access the search XML. They connect to a different page at Google, which gives you data in a different format. Basically, you were getting 403 because you weren't allowed to access the data the way you were, and Google knew it (...) – Kevin Lacquement Mar 01 '09 at 21:28
  • 4
    (...) because your app either (a) didn't send a User-Agent string or (b) sent a default one that Google recognized as a robot (see http://google.com/robots.txt) – Kevin Lacquement Mar 01 '09 at 21:29
  • 2
    The problem with their api's are that they don't return the same results as google.com. See http://code.google.com/p/google-ajax-apis/issues/detail?id=43 – Anders Rune Jensen May 22 '10 at 22:10
  • One thing I didn't like: they limit you to 64 results. – Don Kirkby Nov 08 '10 at 01:38
2

As lacqui suggested, the Google API's are the way they want you to make requests from code. Unfortunately, I found their documentation was aimed at people writing AJAX web pages, not making raw HTTP requests. I used LiveHTTP Headers to trace the HTTP requests that the samples made, and I found ddipaolo's blog post helpful.

One more thing that messed me up: they limit you to the first 64 results from a query. Usually not a problem if you are just providing web users with a search box, but not helpful if you're trying to use Google to go data mining. I guess they don't want you to go data mining using their API. That 64 number has changed over time and varies between search products.

Update: It appears they definitely do not want you to go data mining. Eventually, you get a 403 error with a link to this API access notice.

Please review the Terms of Use for the API(s) you are using (linked in the right sidebar) and ensure compliance. It is likely that we blocked you for one of the following Terms of Use violations: We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.

They also list other violations, but I think that's the one that triggered for me. I may have to investigate Yahoo's BOSS service. It doesn't seem to have as many restrictions.

Community
  • 1
  • 1
Don Kirkby
  • 53,582
  • 27
  • 205
  • 286
0

You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794