2

I'm attempting to parse the title and text from each news element from a news search "test" Google.

The search URL is :https://www.google.com/search?biw=2513&tbm=nws&sxsrf=ALeKk02tev7vVkPiKz3E20Lih1-7Ol8SBw%3A1612526096099&ei=EDIdYNXbBdmc1fAPid678A0&q=test&oq=test&gs_l=psy-ab.3..0l10.25658.26016.0.26105.4.4.0.0.0.0.74.204.3.3.0....0...1c.1.64.psy-ab..1.3.202....0.y_53L-Gyyyw

Each element contains the g-card tag:

enter image description here

When I attempt to parse using:

from bs4 import BeautifulSoup
import requests

url="https://www.google.com/search?q=bitcoin&sxsrf=ALeKk00r2AqKlBSgzF1T_zG1uQBaBKSN1g:1612525788197&source=lnms&tbm=nws&sa=X&ved=2ahUKEwji6q7W1tLuAhW0ShUIHSGmBpoQ_AUoAXoECBcQAw&biw=2513&bih=1315"
code=requests.get(url)
soup=BeautifulSoup(code.text,"html.parser")
soup.find_all("g-card")

The result is an empty list:

[]

How should I amend find_all in order to return the news results that allow to select the title and text from each result ?

blue-sky
  • 51,962
  • 152
  • 427
  • 752

3 Answers3

1

The website you are trying to parse is dynamic (means the js needs to run in the browser so that it renders the HTML it appears to you)

So using requests to get the HTML just result in returning the whole page source before running the js.

So to parse dynamic websites you have to use something like selenium to run the js in the browser and then you can get the HTML file out of it and parse it using BeautifulSoup.

Volpe95
  • 66
  • 10
0

This does the trick:

soup.text

which contains the text of results.

To parse the URL's:

for link in soup.find_all('a', href=True):
    print(link['href'])

Complete code:

from bs4 import BeautifulSoup
import requests

url_search = 'https://www.google.com/search?biw=2513&bih=817&tbm=nws&sxsrf=ALeKk03-PpUbGxYQpIcp6OcJULFASqa_tA%3A1612525818528&ei=-jAdYKb1H9yf1fAPxrac8AU&q=test&oq=test&gs_l=psy-ab.3..0l10.1628056.1628435.0.1628556.4.4.0.0.0.0.112.340.3j1.4.0....0...1c.1.64.psy-ab..0.4.338....0.H4wnL6N3kBo'
code=requests.get(url_search)
soup=BeautifulSoup(code.text,"html.parser")
print(soup.text)

for link in soup.find_all('a', href=True):
    print(link['href'])
blue-sky
  • 51,962
  • 152
  • 427
  • 752
0

I answered to similar question here.

Code (I added 2 additional lines here for extracting article summary as well):

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

for headings in soup.findAll('div', class_='dbsr'):
    title = headings.find('div', class_='JheGif nDgy9d').text
    summary = headings.find('div', class_='Y3v8qd').text
    link = headings.a['href']
    print(title)
    print(summary)
    print(link)
    print()

Alternatively, you can you Google News Result API from SerpApi.

Part of JSON:

"news_results": [
  {
    "position": 1,
    "link": "https://abc7chicago.com/eisenhower-expressway-crash-wrong-way-chicago-traffic/10456033/",
    "title": "Eisenhower Expressway crashes: 5 killed in separate I-290 wrong-way crashes in Chicago, Forest Park",
    "source": "WLS-TV",
    "date": "16 hours ago",
    "thumbnail": "https://serpapi.com/searches/606340870574f50571da7bfd/images/2f5ade266f837059c67526895fb3916f7518aefbb5215951bb79d83871345dedc741519fefe9c85a8abb834360552c65898af6461c5709de.jpeg"
  }
]

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "chicago",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
   print(f"Article summary: {news_result['snippet']}\n")

Output:

Article summary: A Chicago-based marijuana cultivator and dispenser that has rapidly grown 
into one of the nation's biggest pot firms is under federal ...

Article summary: With 2021 being a pivotal season for the Chicago Cubs and the direction of 
the franchise, here are three bold predictions you may see play out ...

Article summary: The Chicago Blackhawks have lacked puck management. With a team with high 
offensive upside in the Carolina Hurricanes, this cannot ...

Article summary: Chicago, IL - Lírica, Chicagos New Latin-American Inspired Restaurant and 
Bar,

Article summary: A father of three is lucky to be alive after what he describes as a failed 
carjacking that left him running for his life, and his car riddled with 
bullets, ...

Article summary: Robservations on the media beat: VSiN, the Las Vegas-based sports 
information network founded by a group of Chicago entrepreneurs in ...

Article summary: In the day's first reported shooting a man was shot about 2 a.m. in the 
2700 block of South Karlov Avenue.

Article summary: Cameo, the Chicago-based startup that lets users buy video shout-outs from 
celebrities, has banked $100 million in Series C funding -- which ...

Article summary: CHICAGO (CBS) — Although much of the contemporary discussion of COVID-19 
center around rolling out the vaccines, there are still people ...

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35