Here's the code that I tested on several search results.
In order to make it work with different search results, just change requests.get
in the response
variable.
A shorter url (example: https://www.google.com/search?hl=en-US&q=best+cookies&tbm=nws
) can be also used.
Code and full example:
from bs4 import BeautifulSoup
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
headers=headers)
html = response.text
soup = BeautifulSoup(html, 'lxml')
for headings in soup.findAll('div', class_='dbsr'):
title = headings.find('div', class_='JheGif nDgy9d').text
link = headings.a['href']
print(title)
print(link)
print()
Output:
The BEST cookie on the planet (and the Village too!)
https://thecoastnews.com/the-best-cookie-on-the-planet-and-the-village-too/
Best baking kits for kids 2021: Cookie mixes to flapjack recipes
https://www.independent.co.uk/extras/indybest/food-drink/baking/best-kids-baking-kits-b1821245.html
The official Girl Scout cookie power rankings
https://www.latimes.com/food/story/2021-02-24/girl-scout-cookie-power-rankings
Girl Scout Cookie Taste Test: Little Brownie Bakers vs. ABC
https://www.thedailymeal.com/eat/girl-scout-cookie-taste-comparison-abc-little-brownie-bakers
Food Critic, Provocateur Definitively Ranks Girl Scout Cookies
https://www.npr.org/2021/03/07/974226510/food-critic-provocateur-definitively-ranks-girl-scout-cookies
Chef Magnus Nilsson Jam Shortbread Cookie Recipe From ...
https://www.bloomberg.com/news/articles/2021-02-26/chef-magnus-nilsson-jam-shortbread-cookie-recipe-from-faviken-breakfast
Top 10 Best Cookie Cutters 2021 – Bestgamingpro
https://bestgamingpro.com/cookie-cutters/
Learn to make a favorite Girl Scout cookie at home
https://www.latimes.com/food/story/2021-02-25/learn-to-make-the-best-girl-scout-cookie-at-home
The 5 Best Cookie Jars
https://www.elitedaily.com/p/the-5-best-cookie-jars-63505798
Ulker Biskuvi Turkey's Best Cookie Picked as Top Stock for 2021
https://www.bloomberg.com/news/articles/2021-02-25/cookie-maker-tops-turkey-s-best-stock-bets-amid-hunt-for-value
Also, in order to get .text
and url's
you need to specify from which source (div
or whatever) you want to scrape it.
In your code, you specified only one div
and one class
+ if you want to return .text
, it will give you an error: AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
In that case, you can use for loop
and inside grab what you want.
Sometimes when you call find_all()
/findAll()
it will give you an empty list because you don't specify a user-agent
. Default user-agent
is different (could be a tablet) with different classes and selectors. Because of this, when you call a request with class_=()
"bkWMgd"
in reality this class_()
is different because it has a different user-agent
. Hope this makes sense.
I skipped the input
element since it complicates stuff :)
Alternatively, you can also use SerpApi News Result API to get these (and more) results.
SerpApi Example JSON News Results:
"news_results": [
{
"position": 1,
"title": "Trump brushes aside environmental concerns, signs 2 executive ...",
"link": "https://www.usatoday.com/story/news/nation/2019/04/10/president-trump-orders-speed-oil-gas-pipeline-projects/3431466002/",
"source": "USA TODAY",
"date": "6 hours ago",
"snippet": "Aiming to streamline oil and gas pipeline projects, President Donald Trump on Wednesday signed two executive orders making it harder for ...",
"category": "In-Depth",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRQdBI3wIjf_BX3zfRRYJjTGRRF5CNNZvqWAuza8-4mVZ75iBjlwOVTxcfGtg6_hLyUbPQ9cFA"
}
]
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "best cookies",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\n, Link: {news_result['link']}")
Disclaimer: I work for SerpApi.