How to scale up scraping of search results (currently using requests and bs4 in Python)

Question

I wrote some Python code using requests to try to build a database of search result links:

from bs4 import BeautifulSoup
import requests
import re

for i in range(0, 1000, 20):
    url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
    page = requests.get(url)
    if i == 0:
        soup = BeautifulSoup(page.content)
    else:
        soup.append(BeautifulSoup(page.content))

links = soup.findAll("a")

clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

However, after only 40 results Google suspected me of being a robot and quit providing results. That's their prerogative, but is there a (legitimate) way of getting around this?

Can I have some sort of authentication in requests/bs4 and, if so, is there some kind of account that lets me pay them for the privilege of scraping all 10-20,000 results?

Have you seen if the Google Search API helps you achieve your goals: https://developers.google.com/custom-search/json-api/v1/overview and [here](https://developers.google.com/custom-search/) — idjaw, Jul 01 '17 at 17:04
1. Yes, of course requests supports authentication. 2. You'd have to ask Google. — jonrsharpe, Jul 01 '17 at 17:05
@idjaw I've been studying the Google documentation for several hours but haven't found anything useful yet. Correct me if I'm wrong but I think that API is for their now-retiring Site Search service and some related webmaster tools for integrating custom searches from your own domain. I think their 2 main tools within the API quit being sold on April 1st, but were not exactly what I need anyway. But good idea tho. — Hack-R, Jul 01 '17 at 17:06
@Hack-R Yeah, I did not do any extensive reading in to this, just looked related. Also, this answer is dated, but apparently you still can search the web with the API -> https://stackoverflow.com/questions/2445308/is-there-a-way-to-programmatically-access-googles-search-engine-results. Other than that, I think this is most likely something Google does not want you to do, if they don't make it easy to do by not providing an obvious API. — idjaw, Jul 01 '17 at 17:10
It is against Google's [Webmaster Guidelines](//developers.google.com/search/docs/advanced/guidelines/webmaster-guidelines) and [terms of service](//policies.google.com/terms/archive/20020906?hl=en) to submit programmatic search queries. Running this code against Google is likely to cause Google to show captcha for searches from your IP address. — Stephen Ostermiller, Aug 17 '22 at 07:44

How to scale up scraping of search results (currently using requests and bs4 in Python)

0 Answers0