2

I've tried to create a Web Scraper for CNN. My goal is to scrape all news articles within the search query. Sometimes I get an output for some of the scraped pages and sometimes it doesn't work at all.

I am using selenium and BeautifulSoup packages in Jupiter Notebook. I am iterating over the pages via the url parameters &page={}&from={}. I tried by.XPATH before and simply clicking the next button at the end of the page, but it gave me the same results.

Here's the code I'm using:

#0 ------------import libraries
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import feedparser
import urllib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pickle
import pandas as pd

#3 ------------CNN SCRAPER
#3.1 ----------Define Funktion
def CNN_Scraper(max_pages):
    base = "https://edition.cnn.com/"
    browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    load_content = browser.implicitly_wait(30)
    base_url = 'https://edition.cnn.com/search?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100'
    
 #-------------Define empty lists to be scraped
    CNN_title   = []
    CNN_date   = []
    CNN_article   = []
    article_count = 0
        

 #-------------iterate over pages and extract   
    for page in range(1, max_pages + 1):
        print("Page %d" % page)
        
        url= base_url + "&page=%d&from=%d" % (page, article_count)
        browser.get(url)
        load_content
        soup = BeautifulSoup(browser.page_source,'lxml')
        search_results = soup.find('div', {'class':'cnn-search__results-list'})
        contents = search_results.find_all('div', {'class':'cnn-search__result-contents'})

        for content in contents:
            try:
                title = content.find('h3').text
                print(title)
                link = content.find('a')
                link_url = link['href']    

                date = content.find('div',{'class':'cnn-search__result-publish-date'}).text.strip()
                article = content.find('div',{'class':'cnn-search__result-body'}).text
            except:
                print("loser")
                continue
            CNN_title.append(title)
            CNN_date.append(date)
            CNN_article.append(article)
            
        article_count += 100   
        print("-----")
        
 #-------------Save in DF    
    df = pd.DataFrame()
    df['title'] = CNN_title
    df['date'] = CNN_date      
    df['article'] = CNN_article 
    df['link']=CNN_link
    return df        

    #print("Complete")

    browser.quit()
    
#3.2 ----------Call Function - Scrape CNN and save pickled data
CNN_data = CNN_Scraper(2)
#CNN_data.to_pickle("CNN_data")
flw
  • 47
  • 6

1 Answers1

1

Call the back-end API directly. For more details check my previous answer

import requests
import json


def main(url):
    with requests.Session() as req:
        for item in range(1, 1000, 100):
            r = req.get(url.format(item)).json()
            for a in r['result']:
                print("Headline: {}, Url: {}".format(
                    a['headline'], a['url']))


main("https://search.api.cnn.io/content?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100&from={}")
  • That works beautifully. I'm trying to do the same with a couple of newspapers. I tried following your explanation in the previous answer to locate the XHR request on CNBC. Website I want to scrape: https://www.cnbc.com/search/?query=coronavirus&qsearchterm=coronavirus I only found this (which doesn't work): https://api.sail-personalize.com/v1/personalize/initialize?pageviews=1&isMobile=0&query=coronavirus&qsearchterm=coronavirus Could you help me out? If you feel like this is off topic I can post it in a new question. – flw Apr 11 '20 at 08:56
  • @FlaviaWagner well since it's **off-topic** for the current question. feel free to open a new question. – αԋɱҽԃ αмєяιcαη Apr 11 '20 at 08:58
  • As requested: https://stackoverflow.com/questions/61154530/calling-back-end-api-of-cnbc-in-python – flw Apr 11 '20 at 09:19