It's not rendered via JavaScript as pawelbylina mentioned, and you don't have to use requests-html
or selenium
since everything needed is in the HTML, and it will slow down the scraping process a lot because of page rendering.
It could be because there's no user-agent
specified thus Google blocks your request and you receiving a different HTML with some sort of error because the default requests
user-agent
is python-requests. Google understands it and blocks a request since it's not the "real" user visit. Checks what's your user-agent
.
Pass user-agent
intro request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
You're looking for this, use select_one()
to grab just one element:
soup.select_one('#wob_dc').text
Have a look at SelectorGadget Chrome extension to grab CSS
selectors by clicking on the desired elements in your browser.
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "일본 桜川市真壁町古城 내일 날씨",
"hl": "en",
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
location = soup.select_one('#wob_loc').text
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Location: {location}\n'
f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
------
'''
Location: Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Weather condition: Cloudy
Temperature: 79°F
Precipitation: 40%
Humidity: 81%
Wind speed: 7 mph
Current time: Saturday
'''
Alternatively, you can achieve the same thing by using the Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to think about how to bypass block from Google or figure out why data from certain elements aren't extracting as it should since it's already done for the end-user. The only thing that needs to be done is to iterate over structured JSON and grab the data you want.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "일본 桜川市真壁町古城 내일 날씨",
"api_key": os.getenv("API_KEY"),
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
--------
'''
Makabecho Furushiro, Sakuragawa, Ibaraki, Japan
Saturday
Cloudy
79°F
40%
81%
7 mph
'''
Disclaimer, I work for SerpApi.