I wrote some Python code using requests
to try to build a database of search result links:
from bs4 import BeautifulSoup
import requests
import re
for i in range(0, 1000, 20):
url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
page = requests.get(url)
if i == 0:
soup = BeautifulSoup(page.content)
else:
soup.append(BeautifulSoup(page.content))
links = soup.findAll("a")
clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))
However, after only 40 results Google suspected me of being a robot and quit providing results. That's their prerogative, but is there a (legitimate) way of getting around this?
Can I have some sort of authentication in requests
/bs4
and, if so, is there some kind of account that lets me pay them for the privilege of scraping all 10-20,000 results?