0

Following the automate the boring stuff tutorial in chapter 11, the I’m Feeling Lucky Google Search project. It's downloading the HTML data correctly seemingly but when I use beautifulsoup to select the result links I get nothing. According to the book it said use soup.select('.r a') and it didn't select anything.

Reading the documentation I tried using differing syntax soup.select('[class~=r]') to hopefully get beautifulsoup to select something but it didn't. I've also tried selecting different classes and it didn't do that either so I assume I'm doing something fundamentally wrong.

SEARCHVAR = sys.argv[1:]

res = requests.get('http://google.com/search?q=' + ' '.join(SEARCHVAR))
res.raise_for_status()
print('Searching ' + ' '.join(SEARCHVAR[:]) + ' on Google')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print('Parsing')
linkElems = soup.select('.r a') 
print(str(linkElems))

I used the print(str(linkElems)) to check what beautifulsoup is selecting but I keep getting nothing, just [].

yaho cho
  • 1,779
  • 1
  • 7
  • 19
Alan Tsui
  • 1
  • 1
  • 1
    I'm guessing the tutorial is out of date. I don't see that class in the requests response. I see randomly generated compound classes. Also, why the ' '.join(SEARCHVAR) ? Seems a bit odd to have spaces between characters. – QHarr May 27 '19 at 19:03
  • @QHarr I am having the same problem, is this fixable? – CountDOOKU Nov 10 '19 at 03:56
  • click here, is the same problem https://stackoverflow.com/questions/56664934/soup-select-r-a-in-fhttps-google-com-searchq-query-brings-back-empty – juloi Mar 03 '20 at 03:56

2 Answers2

1

This doesn't work because of your get request to Google. If I use developer tools in chrome on Google the div class r does exist. However, when I download the query with request.get it's no longer there. However, there's now a div class called 'jfp3ef'. I was able to get the a tags associated with the search results with the following

soup = soup.find_all("div", {"class": "jfp3ef"})
for div in soup:
    print(div.select("a"))

If you want you can download the entire page with the divs in the r class by using urllib.request, but Google blocks this behavior so you have to change the header information.

SEARCHVAR = sys.argv[1:]
query = 'http://google.com/search?q=' + ' '.join(SEARCHVAR)
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 
(KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(query, headers= headers)
html = urllib.request.urlopen(req).read()
print('Searching ' + ' '.join(SEARCHVAR[:]) + ' on Google')
soup = bs4.BeautifulSoup(html, 'html.parser')
print('Parsing')
linkElems = soup.select('.r a') 
print(str(linkElems)

The example in the book is out of date. I assume my top example with class "jfp3ef" is randomized from google and will break soon or may not work for you at all. The bottom example does work well.

Geo
  • 61
  • 5
  • Thanks for explanation, I was so confused on why the select wasn't working. But I'm still confused on what the difference between what I seen when I use developer tools and what requests.get shows. – Alan Tsui May 27 '19 at 21:34
  • also I used select(.jfp3ef a) and it started working but I'm getting two of each result – Alan Tsui May 27 '19 at 22:53
  • @the bottom example still cannot retrieve it all only the top 2 searches. – CountDOOKU Nov 10 '19 at 06:57
0

Replace this:

linkElems = soup.select('div#main > div > div > div > a')

With:

linkElems = soup.select('div#main > div > div > div > a')
Tom Carrick
  • 6,349
  • 13
  • 54
  • 78
  • Answers with more detailed explanations of the suggested changes tend to garner more upvotes, and are more likely to get accepted. Where appropriate, point out the differences between the current state and your proposed change. – Savage Henry Jun 03 '20 at 00:03