1

I am trying to web scrape answer from a google quick answer box but the element I want isn't giving any value. But it shows value according to the element source. The code I have used is as follows:

from bs4 import BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
query = question
query = urllib.urlencode ( { 'q' : query } )
url = "http://www.google.com/search?%s&"+query
page = opener.open(url)
soup = BeautifulSoup(page,"html.parser")
box=soup.find_all('div', class_='_XWk')
print(box)
Anu
  • 119
  • 2
  • 11
  • Firstly, you have some trouble with your URL template. The URL params syntax is obviously broken. Secondly, I tried your code and found that page variable is and object, not string containing HTML, so it's absolutely useless to put it to BeautifulSoup constructor. – Andrew Che Mar 14 '17 at 18:02
  • Third, urllib(2) is not recommended for usage, it's better to use requests instead. See this link for more info: http://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module – Andrew Che Mar 14 '17 at 18:05
  • And finally, I'd like to help you, but I am in Russia and I can't find a div with such a class on the search results page. Maybe it depends on the country? I could help if you saved a search results page from your browser and put it here as a HTML file – Andrew Che Mar 14 '17 at 18:28
  • These are the search results page from browser. It appears only in the Inspect code section.
    Rabindranath Tagore
    – Anu Mar 15 '17 at 00:09
  • What is in the address bar when you do it in a browser? – Andrew Che Mar 15 '17 at 06:30
  • https://www.google.co.in/?gfe_rd=cr&ei=OmnJWO-iGK3T8gejyqrgAw&gws_rd=ssl#q=wrote+national+anthem+of+india&* – Anu Mar 15 '17 at 16:18
  • I found the same question in http://stackoverflow.com/questions/42808534/how-to-get-googles-fast-answer-box-text . But i am not sure how to get the url that is mentioned in the answer programmatically as I need to be able to extract the answer box for different queries. – Anu Mar 15 '17 at 16:52
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138138/discussion-between---and-anu). – Andrew Che Mar 15 '17 at 17:00
  • Could you please save https://www.google.com/?gfe_rd=cr&ei=CJrJWJK9Ac7CtAGNj6rABA&gws_rd=cr&fg=1#q=definition:+calcium& and https://www.google.com/search?q=definition:+calcium&bav=on.2,or.r_cp.&cad=b&fp=1&biw=1920&bih=984&dpr=1&tch=1&ech=1&psi=1489578048971.3 from your browser and post them here? I managed to parse those data and get 10 search results, but it seems that the format varies from country to country And if you try refreshing the search results page you will notice that class names are automatically generated on each load (to prevent scraping), but there are some constant classes – Andrew Che Mar 16 '17 at 12:48

1 Answers1

0

Since you're trying to scrape just one element from the whole HTML (if so) there's no need to use find_all()/findAll() methods.

Instead, you can use find() or select_one() methods that bs4 provides to grab one specific element or select using CSS selectors. You can use SelectorGadget to find css selectors.


For example: Say you want to scrape Weather data from Google Search answer box result. You can do this like so:

  1. Using a custom script. I scraped a bit more just to show that it's a straightforward process.
  2. Using Google Direct Answer Box API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the playground to test out.

Code and full example in the online IDE (works on other weather searches as well):

from bs4 import BeautifulSoup
import requests, lxml

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.google.com/search?q=london weather', headers=headers).text
soup = BeautifulSoup(response, 'lxml')

weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text

print(f'Weather condition: {weather_condition}\nTempature: {tempature}°F\nPrecipitation: {precipitation}\nHumidity: {humidity}\nWind speed: {wind}\nCurrent time: {current_time}')

# output:
'''
Weather condition: Mostly cloudy
Tempature: 47°F
Precipitation: 79%
Humidity: 49%
Wind speed: 9 mph
Current time: Thursday 10:00 AM
'''

Basically, the main difference is that by using Google Direct Answer Box API everything is already done for the end-user with a json output and you don't need to figure out stuff and tinker with HTML elements to get desired output or guessing why the output is different although it should be quite different.

Code to scrape weather answer box:

from serpapi import GoogleSearch
import os

params = {
  "engine": "google",
  "q": "london weather",
  "api_key": os.getenv("API_KEY"),
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
unit = results['answer_box']['unit']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']

forecast = results['answer_box']['forecast']

print(f'{loc}\n{weather_date}\n{weather}\n{temp}\n{unit}\n{precipitation}\n{humidity}\n{wind}\n\n{forecast}')

# output:
'''
London, UK
Thursday 7:00 AM
Mostly sunny
53
Fahrenheit
2%
89%
1 mph

[{'day': 'Thursday', 'weather': 'Mostly cloudy', 'temperature': {'high': '70', 'low': '53'}]
...
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35