Scraping Google News results with Python and Beautiful Soup retrieves only the first page without headlines

Question

I want to scrape headlines and paragraph texts from Google News search page based on the term searched. I want to do that for first n pages.

I have wrote a piece of code for scraping the first page only, but I do not know how to modify my url so that I can go to other pages to (page 2, 3 ...). That's the first problem that I have.

Second problem is that I do not know how to scrape headlines. It always returns me empty list. I have tried multiple solutions but it always returns me empty list. (I do not think that page is dynamic).

On the other hand scraping paragraph text below the headline works perfectly. Can you tell me how to fix these two problems?

This is my code:

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works

Is it a good idea to assume google news class names are going to stay the same? — fzn, Nov 17 '19 at 11:28

Arn · Accepted Answer · 2019-11-17T11:48:51.423

1

Problem One: Flipping the page.

In order to move to the next page you need to include start keyword in your URL formatted string:

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

Problem Two: Scraping the headlines.

Google regenerates the names of classes, ids, etc. of DOM elements so your approach is likely to fail every time you retrieve some new, uncached information.

edited Nov 17 '19 at 11:48

answered Nov 17 '19 at 11:30

Arn

1,898
12
26

Thanks and can you please show me how to access headlines? – taga Nov 17 '19 at 11:34
For some reason (I do not know why) I get `KeyError: 'page'`...thats strange – taga Nov 17 '19 at 11:44
Thanks but I think that name of the classes are always the same. Names for paragraph classes (the one that works in my code) are the same for the past year – taga Nov 17 '19 at 12:01
Ok, so there is no way to get the headlines? – taga Nov 17 '19 at 12:21

Radek Zika · Answer 2 · 2019-11-17T13:28:32.643

1

Just add parameter 'start=10' to the search term. Like: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

For dynamic behavior/loop over response pages use something like this:

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')

edited Nov 17 '19 at 13:28

answered Nov 17 '19 at 11:35

Radek Zika

11
3

But how to make it dynamic, to have variables `term` and for loop for page `range(1,5)` – taga Nov 17 '19 at 11:55

Scraping Google News results with Python and Beautiful Soup retrieves only the first page without headlines

2 Answers2