0

I want to scrape headlines and paragraph texts from Google News search page based on the term searched. I want to do that for first n pages.

I have wrote a piece of code for scraping the first page only, but I do not know how to modify my url so that I can go to other pages to (page 2, 3 ...). That's the first problem that I have.

Second problem is that I do not know how to scrape headlines. It always returns me empty list. I have tried multiple solutions but it always returns me empty list. (I do not think that page is dynamic).

On the other hand scraping paragraph text below the headline works perfectly. Can you tell me how to fix these two problems?

This is my code:

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works
Arn
  • 1,898
  • 12
  • 26
taga
  • 3,537
  • 13
  • 53
  • 119

2 Answers2

1

Problem One: Flipping the page.

In order to move to the next page you need to include start keyword in your URL formatted string:

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

Problem Two: Scraping the headlines.

Google regenerates the names of classes, ids, etc. of DOM elements so your approach is likely to fail every time you retrieve some new, uncached information.

Arn
  • 1,898
  • 12
  • 26
1

Just add parameter 'start=10' to the search term. Like: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

For dynamic behavior/loop over response pages use something like this:

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')
Radek Zika
  • 11
  • 3
  • But how to make it dynamic, to have variables `term` and for loop for page `range(1,5)` – taga Nov 17 '19 at 11:55