beautifulsoup select method not selecting results as expected

Question

Following the automate the boring stuff tutorial in chapter 11, the I’m Feeling Lucky Google Search project. It's downloading the HTML data correctly seemingly but when I use beautifulsoup to select the result links I get nothing. According to the book it said use soup.select('.r a') and it didn't select anything.

Reading the documentation I tried using differing syntax soup.select('[class~=r]') to hopefully get beautifulsoup to select something but it didn't. I've also tried selecting different classes and it didn't do that either so I assume I'm doing something fundamentally wrong.

SEARCHVAR = sys.argv[1:]

res = requests.get('http://google.com/search?q=' + ' '.join(SEARCHVAR))
res.raise_for_status()
print('Searching ' + ' '.join(SEARCHVAR[:]) + ' on Google')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print('Parsing')
linkElems = soup.select('.r a') 
print(str(linkElems))

I used the print(str(linkElems)) to check what beautifulsoup is selecting but I keep getting nothing, just [].

I'm guessing the tutorial is out of date. I don't see that class in the requests response. I see randomly generated compound classes. Also, why the ' '.join(SEARCHVAR) ? Seems a bit odd to have spaces between characters. — QHarr, May 27 '19 at 19:03
click here, is the same problem https://stackoverflow.com/questions/56664934/soup-select-r-a-in-fhttps-google-com-searchq-query-brings-back-empty — juloi, Mar 03 '20 at 03:56

Geo · Answer 1 · 2019-05-27T20:49:07.777

This doesn't work because of your get request to Google. If I use developer tools in chrome on Google the div class r does exist. However, when I download the query with request.get it's no longer there. However, there's now a div class called 'jfp3ef'. I was able to get the a tags associated with the search results with the following

soup = soup.find_all("div", {"class": "jfp3ef"})
for div in soup:
    print(div.select("a"))

If you want you can download the entire page with the divs in the r class by using urllib.request, but Google blocks this behavior so you have to change the header information.

SEARCHVAR = sys.argv[1:]
query = 'http://google.com/search?q=' + ' '.join(SEARCHVAR)
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 
(KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(query, headers= headers)
html = urllib.request.urlopen(req).read()
print('Searching ' + ' '.join(SEARCHVAR[:]) + ' on Google')
soup = bs4.BeautifulSoup(html, 'html.parser')
print('Parsing')
linkElems = soup.select('.r a') 
print(str(linkElems)

The example in the book is out of date. I assume my top example with class "jfp3ef" is randomized from google and will break soon or may not work for you at all. The bottom example does work well.

Thanks for explanation, I was so confused on why the select wasn't working. But I'm still confused on what the difference between what I seen when I use developer tools and what requests.get shows. — Alan Tsui, May 27 '19 at 21:34
also I used select(.jfp3ef a) and it started working but I'm getting two of each result — Alan Tsui, May 27 '19 at 22:53
@the bottom example still cannot retrieve it all only the top 2 searches. — CountDOOKU, Nov 10 '19 at 06:57

score 0 · Answer 2 · edited Jun 02 '20 at 14:57

0

Replace this:

linkElems = soup.select('div#main > div > div > div > a')

With:

linkElems = soup.select('div#main > div > div > div > a')

edited Jun 02 '20 at 14:57

Tom Carrick

6,349
13
54
78

answered Jun 02 '20 at 14:21

Leonardo Garcia Tampelini

9
1

Answers with more detailed explanations of the suggested changes tend to garner more upvotes, and are more likely to get accepted. Where appropriate, point out the differences between the current state and your proposed change. – Savage Henry Jun 03 '20 at 00:03

beautifulsoup select method not selecting results as expected

2 Answers2

Linked