It's because you haven't specified user-agent
in HTTP requests headers
. Learn more about user-agent
and request headers
.
Basically, user-agent
let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not. Check what's your user-agent
.
Pass user-agent
into request headers
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Use select_one()
instead. CSS
selectors are more readable and a bit faster. CSS
selectors reference.
soup.select_one('#result-stats nobr').previous_sibling
# About 107,000 results
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params)
soup = BeautifulSoup(response.text, 'lxml')
# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 107,000 results
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. it's a paid API with a free plan.
The main difference in your case is that you don't have to deal with selecting selectors to extract data or maintain parser over time since it's already done for the end-user. The only thing that needs to be done is just to get the data you want from the structured JSON.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah defenition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 107000
P.S - I wrote a blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.