Web Scraping news articles in some cases returns empty body

Question

I just wanted to scrape a few articles from El Pais website archive. From each article I take: title, hashtags and article body. The HTML structure of each article is the same and script is successful with all the titles and hashtags, however for some of the articles it does not scrape the body at all. Below I add my code, links to fully working articles and also a few links to the ones returning empty bodies. Do you know how to fix it? The empty body articles do not happen regularly, so sometimes there can be 3 empty articles in a row, then 5 successful articles, 1 empty, 3 successful.

Working articles article1 https://elpais.com/diario/1990/01/17/economia/632530813_850215.html article2 https://elpais.com/diario/1990/01/07/internacional/631666806_850215.html article3 https://elpais.com/diario/1990/01/05/deportes/631494011_850215.html

Articles without the body article4 https://elpais.com/diario/1990/01/23/madrid/633097458_850215.html article5 https://elpais.com/diario/1990/01/30/economia/633654016_850215.html article6 https://elpais.com/diario/1990/01/03/espana/631321213_850215.html

    from bs4 import BeautifulSoup
    import requests
    #place for the url of the article to be scraped
    URL = some_url_of_article_above
    #print(URL)
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    bodydiv = soup.find("div", id="ctn_article_body")
    artbody = bodydiv.find_all("p", class_="")
    tagdiv = soup.find("div", id="mod_archivado")
    hashtags= tagdiv.find_all("li", class_="w_i | capitalize flex align_items_center")
    titlediv = soup.find("div", id="article_header")
    title = titlediv.find("h1")
    #print title of the article
    print(title.text)
    #print body of the article
    arttext = ""
    for par in artbody:
        arttext += str(par.text)
    print(arttext)
    #hastags
    tagstring = ""
    for hashtag in hashtags:
        tagstring += hashtag.text
        tagstring += ","
    print(tagstring)

Thank you in advance for your help!

Gealber · Accepted Answer · 2021-07-02T20:30:28.893

The problem is that inside that <div class="a_b article_body | color_gray_dark" id="ctn_article_body"> element there's a broken or incomplete  tag. Take a look at this code snippet from the html page:

<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa dela Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p>

Just after the first  tags, there an  without its pair  tag. That's the reason because "html.parser" it is failing.

Using this text,

from bs4 import BeautifulSoup

text = """<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa de la Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p><div id="elpais_gpt-INTEXT" style="width: 0px; height: 0px; display: none;"></div><p class="">Por su parte, José Luis Garro, tercer teniente de alcalde, ha declarado a EL PAÍS: "Tenemos una autorización provisional del rector de la Universidad Complutense. Toda esa zona, además, está pendiente de un plan especial de reforma interior (PERI). Ésta es sólo una solución provisional".</p><p class="">Según Garro, el trazado de la carretera "ha tenido que dar varias vueltas para no tocar las masas arbóreas", aunque reconoce que se ha hecho "en algunos casos", si bien causando "un daño mínimo".</p><p class="footnote">* Este artículo apareció en la edición impresa del lunes, 22 de enero de 1990.</p></div>"""

soup = BeautifulSoup(text, "html.parser")
print(soup.find("div"))

Output:

<div class="a_b article_body | color_gray_dark" id="ctn_article_body"><p class=""></p></div>

How to solve this? Well I made another try with a different parser, in this case I made use of "lxml" instead of "html.parser", and it works.

It selected the div, so just changing this line should work

soup = BeautifulSoup(text, "lxml")

Of course you will need to have this parser installed.

EDIT:

As @moreni123 commented below, this solution seems to be correct for certain cases but not for all. Given that, I will add another option that could also work.

It seems that it would be better to use Selenium to fetch webpage, given that some content is been generated with JavaScript and requests cannot do that, it's not its purpose.

I'm going to use Selenium with a headless chrome driver,

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# article to fetch
url = "https://elpais.com/diario/1990/01/14/madrid/632319855_850215.html"

driver_options = Options()
driver_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)

# this is the source code with the js executed
driver.get(url)
page = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

# now, as before we use BeautifulSoup to parse it. Selenium is a 
# powerful, tool you could use Selenium for this also
soup = BeautifulSoup(page, "html.parser")
print(soup.select("#ctn_article_body"))

#quiting driver
if driver is not None:
    driver.quit()

Make sure that the path to the chrome driver is correct, in this line

driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)

Here is a link to the Selenium doc, and to the ChromeDriver. In case you need to download it.

This solution should work. At least in this article that you passed me, it works.

Dear Gealber, thank you so much for your reply and help. I have already tested your solution over some articles and can admit that it works. However, I inspected the HTML structure in both web browser and beautiful soap and haven't found the tag(I used exactly the same article as you did). After a few hundred articles I encountered the same problem of the empty body but both lxml and html.parser do not work. Here link with an example: http://elpais.com/diario/1990/01/14/madrid/632319855_850215.html. Do you have any idea/solution this time? Thank you in advance! — moreni123, Jul 02 '21 at 19:32
With that same article, let's go for parts first, the `` will only be visible if you download the source code of the html. The source code of the html, is not the same as the code that show you the inspector of the browser, at least not always. To see this source code, you could download it with requests and store it in a file to later analyze it, or in Chrome pressing Ctr+U. Now, in this last page that you pass me is just before this text, ` color_gray_dark">
Un hombre de 59 años`. Try to reproduce this steps and tell me about it. — Gealber, Jul 02 '21 at 19:46
In case you get confuse, the source code is not the same, because the inspector in the browser show you all the code after the JavaScript has been executed, but that's not what you download with requests. With request what you download, is the HTML without the JavaScript being executed. There's an option to get the source after the JavaScript been excecuted, but is with Selenium. I will update the answeer with that option too — Gealber, Jul 02 '21 at 19:50

Web Scraping news articles in some cases returns empty body

1 Answers1