1

I am trying to scrape google news headings along with their links for input term. But when I searched via find_all method for a class that contains all news headings, it returned an empty list.

I tried with parent divs with their id's but the result wasn't different.

import requests
from bs4 import BeautifulSoup

input_term = input("Enter a term to search:")
source = requests.get("https://www.google.com/search?q={0}&source=lnms&tbm=nws".format(input_term)).text
soup = BeautifulSoup(source, 'html.parser')

#here 'bkWMgd' is class that I found to be contained all search results.
heading_results = soup.find_all('div', class_ = 'bkWMgd')
print(heading_results)

I want to scrape all news headings and their respective links. I expected a list of all search result from the above code. But it returning an empty list.

Yash Tile
  • 51
  • 7
  • 2
    The DOM is generated dynamically by JavaScript on this page. You're going to need Selenium or some other driver to extract the content you want. – ggorlen May 13 '19 at 05:18
  • 3
    open browser, turn off javascript and load your url - and you will see what requests/breautifulsoup can see. Normally Google uses JavaScript to display results but it also can display page without using JavaScript but then it can use different tags, classes, etc. – furas May 13 '19 at 05:25

2 Answers2

3

The response that is seen by beautifulsoup and the one in your browser is quite different due to the presence of Javascript. Hence the selectors that you use might vary. It's always a good idea to print the response that you receive from beautifulsoup and analyze the HTML & then decide the selectors using class/id appropriately.

import requests
from bs4 import BeautifulSoup

input_term = input("Enter a term to search:")
source = requests.get(
    "https://www.google.com/search?q={0}&source=lnms&tbm=nws".format(input_term)).text
soup = BeautifulSoup(source, 'html.parser')

# here div#ires contains an ol which contains the results.
heading_results = soup.find("div", {"id": "ires"}).find("ol").find_all('h3', {'class': 'r'})
# Loop over each item to obtain the title and link (anchor tag text and link)
print(heading_results)

enter image description here

hem
  • 1,012
  • 6
  • 11
1

Here's the code that I tested on several search results. In order to make it work with different search results, just change requests.get in the response variable.

A shorter url (example: https://www.google.com/search?hl=en-US&q=best+cookies&tbm=nws) can be also used.

Code and full example:

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

for headings in soup.findAll('div', class_='dbsr'):
    title = headings.find('div', class_='JheGif nDgy9d').text
    link = headings.a['href']
    print(title)
    print(link)
    print()

Output:

The BEST cookie on the planet (and the Village too!)
https://thecoastnews.com/the-best-cookie-on-the-planet-and-the-village-too/

Best baking kits for kids 2021: Cookie mixes to flapjack recipes
https://www.independent.co.uk/extras/indybest/food-drink/baking/best-kids-baking-kits-b1821245.html

The official Girl Scout cookie power rankings
https://www.latimes.com/food/story/2021-02-24/girl-scout-cookie-power-rankings

Girl Scout Cookie Taste Test: Little Brownie Bakers vs. ABC
https://www.thedailymeal.com/eat/girl-scout-cookie-taste-comparison-abc-little-brownie-bakers

Food Critic, Provocateur Definitively Ranks Girl Scout Cookies
https://www.npr.org/2021/03/07/974226510/food-critic-provocateur-definitively-ranks-girl-scout-cookies

Chef Magnus Nilsson Jam Shortbread Cookie Recipe From ...
https://www.bloomberg.com/news/articles/2021-02-26/chef-magnus-nilsson-jam-shortbread-cookie-recipe-from-faviken-breakfast

Top 10 Best Cookie Cutters 2021 – Bestgamingpro
https://bestgamingpro.com/cookie-cutters/

Learn to make a favorite Girl Scout cookie at home
https://www.latimes.com/food/story/2021-02-25/learn-to-make-the-best-girl-scout-cookie-at-home

The 5 Best Cookie Jars
https://www.elitedaily.com/p/the-5-best-cookie-jars-63505798

Ulker Biskuvi Turkey's Best Cookie Picked as Top Stock for 2021
https://www.bloomberg.com/news/articles/2021-02-25/cookie-maker-tops-turkey-s-best-stock-bets-amid-hunt-for-value

Also, in order to get .text and url's you need to specify from which source (div or whatever) you want to scrape it.

In your code, you specified only one div and one class + if you want to return .text, it will give you an error: AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In that case, you can use for loop and inside grab what you want.

Sometimes when you call find_all()/findAll() it will give you an empty list because you don't specify a user-agent. Default user-agent is different (could be a tablet) with different classes and selectors. Because of this, when you call a request with class_=() "bkWMgd" in reality this class_() is different because it has a different user-agent. Hope this makes sense.

I skipped the input element since it complicates stuff :)


Alternatively, you can also use SerpApi News Result API to get these (and more) results.

SerpApi Example JSON News Results:

"news_results": [
    {
      "position": 1,
      "title": "Trump brushes aside environmental concerns, signs 2 executive ...",
      "link": "https://www.usatoday.com/story/news/nation/2019/04/10/president-trump-orders-speed-oil-gas-pipeline-projects/3431466002/",
      "source": "USA TODAY",
      "date": "6 hours ago",
      "snippet": "Aiming to streamline oil and gas pipeline projects, President Donald Trump on Wednesday signed two executive orders making it harder for ...",
      "category": "In-Depth",
      "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRQdBI3wIjf_BX3zfRRYJjTGRRF5CNNZvqWAuza8-4mVZ75iBjlwOVTxcfGtg6_hLyUbPQ9cFA"
    }
]

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "best cookies",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
   print(f"Title: {news_result['title']}\n, Link: {news_result['link']}")

Disclaimer: I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35