2

I'm currently grabbing only the first page of google results for a query, but I want to grab the first 5 pages.

gets a string like: https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=0

the variable urls gets all 10 results for the first page, but I started adding conditions to check for 10 urls on this first page, if that is true and there are 10 urls, I want it to keep going to the next url e.g. (provided the next url has 10 results as well) using something like follow_link() and urls below :

https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=10
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=20
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=30
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=40
https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=50

How do I go about doing this? Could anyone please help me out?

CodeTalk
  • 3,571
  • 16
  • 57
  • 92
  • Why don't you use link that you posted - `https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=` to get the result URLS? – 4d4c Aug 11 '13 at 08:14
  • because each of the urls with (start=0, start=10, start=20, etc.) aren't always available. The search is dynamic – CodeTalk Aug 11 '13 at 12:42
  • Sometimes start=0 or start=10 are only used , other times all start=0, start=10, start=20, start=30, start=40, start=50 are used. – CodeTalk Aug 11 '13 at 13:25

1 Answers1

2

You can use BeautifulSoup to locate element with link to the next page:

from mechanize import Browser
from bs4 import BeautifulSoup

br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2;\
                    WOW64) AppleWebKit/537.11 (KHTML, like Gecko)\
                    Chrome/23.0.1271.97 Safari/537.11')]

url = "https://encrypted.google.com/search?hl=en&q=site%3Asomedomain.com&start=0"

r = br.open(url)

soup = BeautifulSoup(r)

nextpage = soup.find("a", {"id": "pnnext"})
print nextpage['href']

Output:

/search?q=site:somedomain.com&hl=en&ei=NJ4HUo2yM-TK4ATJlYGICQ&start=10&sa=N

So now you have the link to the next page. If element wasn't found then it's the last page

4d4c
  • 8,049
  • 4
  • 24
  • 29
  • This gave me a great idea: to just grab all of the numbered pages at the botton.. at once. This is great. Thanks ! – CodeTalk Aug 12 '13 at 15:52