Python - Easy way to scrape Google, download top N hits (entire .html documents) for given search?

Question

Is there an easy way to scrape Google and write the text (just the text) of the top N (say, 1000) .html (or whatever) documents for a given search?

As an example, imagine searching for the phrase "big bad wolf" and downloading just the text from the top 1000 hits -- i.e., actually downloading the text from those 1000 web pages (but just those pages, not the entire site).

I'm assuming this would use the urllib2 library? I use Python 3.1 if that helps.

FYI, this violates section 5.3 of Google's Terms of Service. — icktoofay, Mar 16 '11 at 05:41

score 4 · Answer 1 · answered Mar 16 '11 at 05:39

Check out BeautifulSoup for scraping the content out of web pages. It is supposed to be very tolerant of broken web pages which will help because not all results are well formed. So you should be able to:

Request http://www.google.ca/search?q=QUERY_HERE
Extract and follow result links using BeautifulSoup (It appears as though class="r" for result links)
Extract text from result pages using BeautifulSoup

score 3 · Accepted Answer · edited May 23 '17 at 10:33

3

The official way to get results from Google programmatically is to use Google's Custom Search API. As icktoofay comments, other approaches (such as directly scraping the results or using the xgoogle module) break Google's terms of service. Because of that, you might want to consider using the API from another search engine, such as the Bing API or Yahoo!'s service.

edited May 23 '17 at 10:33

Community

1
1

answered Mar 16 '11 at 05:50

Mark Longair

446,582
72
411
327

Thanks Mark. Would it work with the API from one of these other search engines? – Georgina Mar 16 '11 at 05:58
@Georgina: I haven't done this myself, but it should do - for example, if you Google `bing api python example` the top couple of hits are Python modules to help get search results from that service. You'll still need to use `urllib2` to download the pages at the URLs you find, of course. – Mark Longair Mar 16 '11 at 06:17
They all have the same TOS. no good search engine gives this for free. – WeaselFox Feb 05 '12 at 12:40

score 3 · Answer 3 · answered Mar 16 '11 at 22:22

As mentioned, scraping Google violates their TOS. That said, that's probably not the answer you're looking for.

There's a PHP script available that does a perfect job of scraping Google: http://google-scraper.squabbel.com/ Just give it a keyword, # of results you want, and it'll return all the results for you. Just parse for the URLs returned, use urllib, or curl to extract the HTML source, and you're done.

You also really shouldn't attempt to scrape Google unless you got more than 100 proxy servers though. They'll easily ban your IP temporarily after a few attempts.

Python - Easy way to scrape Google, download top N hits (entire .html documents) for given search?

3 Answers3

Linked