I just wanted to scrape a few articles from El Pais website archive. From each article I take: title, hashtags and article body. The HTML structure of each article is the same and script is successful with all the titles and hashtags, however for some of the articles it does not scrape the body at all. Below I add my code, links to fully working articles and also a few links to the ones returning empty bodies. Do you know how to fix it? The empty body articles do not happen regularly, so sometimes there can be 3 empty articles in a row, then 5 successful articles, 1 empty, 3 successful.
Working articles article1 https://elpais.com/diario/1990/01/17/economia/632530813_850215.html article2 https://elpais.com/diario/1990/01/07/internacional/631666806_850215.html article3 https://elpais.com/diario/1990/01/05/deportes/631494011_850215.html
Articles without the body article4 https://elpais.com/diario/1990/01/23/madrid/633097458_850215.html article5 https://elpais.com/diario/1990/01/30/economia/633654016_850215.html article6 https://elpais.com/diario/1990/01/03/espana/631321213_850215.html
from bs4 import BeautifulSoup
import requests
#place for the url of the article to be scraped
URL = some_url_of_article_above
#print(URL)
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
bodydiv = soup.find("div", id="ctn_article_body")
artbody = bodydiv.find_all("p", class_="")
tagdiv = soup.find("div", id="mod_archivado")
hashtags= tagdiv.find_all("li", class_="w_i | capitalize flex align_items_center")
titlediv = soup.find("div", id="article_header")
title = titlediv.find("h1")
#print title of the article
print(title.text)
#print body of the article
arttext = ""
for par in artbody:
arttext += str(par.text)
print(arttext)
#hastags
tagstring = ""
for hashtag in hashtags:
tagstring += hashtag.text
tagstring += ","
print(tagstring)
Thank you in advance for your help!
Un hombre de 59 años`. Try to reproduce this steps and tell me about it.
– Gealber Jul 02 '21 at 19:46