3

When I try to download a Google Search result page using HttpWebRequest in C#, everything works very well if I use simple search terms, like

http://www.google.com/search?q=stackoverflow

But when I try to make it more complex, for example

http://www.google.com/search?q=inurl%3A%22goethe%22%20filetype%3Apdf

which means

inurl:"goethe" filetype:pdf

, I will receive a 503 error because Google thinks I'm a bot. Is there any workaround?

Edit: UserAgent is set to "Mozilla/5.0".

3 Answers3

3

well.. if your search is done programmatically, then Google just so happens to be right.. you ARE a bot :-)

cheers!

1

I don't believe it has much to do with how complex your query happens to be. The only thing that really matters is if they think that you're a bot. If you're submitting queries at a very high rate, then Google will think you're a bot so there are several possible solutions:

  1. Reduce the rate at which you're sending queries.
  2. Use proxies to make multiple queries.

Additionally, it's important to note that if you make web requests without saving cookies, then that might be another "signal" for Google to think that you're a bot. You should also be very careful not to get the proxies blocked by Google, because you're scraping the big G. It's hard to find free proxies and if you abuse them, then they'll get shut down so be a good citizen!

Good luck!

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • 2
    The funny thing is, I can do simple queries like "stackoverflow" a hundred times in a row, but as soon as I use the "intitle" parameter, Google sends me a 503. "filetype" works though, I was able to receive 900 result pages of PDF files in less than a minute. –  Mar 23 '12 at 17:24
  • @dm Google doesn't look at a single thing when they detect if you're using a "bot" to query for stuff, they look at multiple factors and you might have stumbled upon a combination that they deem is a clear indicator for robot activity. It's *deliberately* difficult to tell what might trigger Google's blocking mechanisms. – Kiril Mar 23 '12 at 18:02
1

Try Google Custom Search APIs and Tools. This will allow you to retrieve search results without fear of being denied access (up to a limit).

Alternatively, mimic all nuances of a typical search query. For example, in my browser, searching for inurl:"goethe" filetype:pdf results in this URL being requested.
Then there are cookies and other http headers. Make it look a lot more like a browser is requesting it.

user1055604
  • 1,624
  • 11
  • 28