There's actually no need to save the HTML file, and one of the reasons why response output is different from the one you see in the browser is that there are no headers being sent with the request, in this case, user-agent
which will act as a "real" user visit (already written by Cucurucho).
When no user-agent
is specified (when using requests
library) it defaults to python-requests thus Google understands it, blocks a request and you receive a different HTML with different CSS
selectors. Check what's your user-agent
.
Pass user-agent
:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
To easier grab CSS
selectors, have a look at the SelectorGadget extension to get CSS
selectors by clicking on the desired element in your browser.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# [:5] - first 5 results
# container with needed data: title, link, snippet, etc.
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time thinking about how to bypass blocks from Google or what is the right CSS
selector to parse the data, instead, you need to pass parameters (params
) you want, and iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Disclaimer, I work for SerpApi.