Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

Question

Hello and thank you kindly for your help,

I've been using Python and Newspaper3k to scrape websites, but I've noticed that some functions are ...well... not functional. In particular, I've only been able to scrape the article HTML of roughly 1/10 or even fewer sites. Here is my code:

from newspaper import Article
url = pageurl.com
article = Article(url, keep_article_html = True, language ='en')
article.download()
article.parse()
print(article.title + "\n" + article.article_html)

What happens is that the article title is scraped, from my experience, 100% of the time, but article HTML is hardly ever successfully scraped, and nothing is returned. I know that Newspaper3k is based on BeautifulSoup, so I don't expect that to work either and am kind of stuck. Any ideas?

edit: most sites I try to scrape are in spanish

What website are you trying to scrape ? It would help guide you how to scrape it. Have to say I've never heard of Newspaper3k. — AaronS, Jul 16 '20 at 20:44
Thanks for the response. Mainly science blogs, research papers, and scientific theories. I should have added that the sites I'm scrapping are mainly in Spanish (my native language). Here is an example of a page that I failed to scrape: http://www.wellness-spain.com/-/estres-produce-acidez-en-el-organismo-principal-causa-de-enfermedades#:~:text=Con%20respecto%20al%20factor%20emocional,produce%20acidez%20en%20el%20organismo. — EricTalodi, Jul 16 '20 at 21:36

score 3 · Accepted Answer · answered Jul 17 '20 at 05:48

So I didn't find too much of a problem scraping wellness-spain.com with beautifulsoup.. The website doesn't have that much javascript. This can cause problems with HTML parsers like beautifulsoup and so you should be mindful when you scrape websites, to turn off javascript to see what output you get from your browser before scraping.

You didn't specify what data you were requiring of that website so I took an educated guess.

Coding Example

import requests 
from bs4 import BeautifulSoup

url = 'http://www.wellness-spain.com/-/estres-produce-acidez-en-el-organismo-principal-causa-de-enfermedades#:~:text=Con%20respecto%20al%20factor%20emocional,produce%20acidez%20en%20el%20organismo'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
title = soup.select_one('h1.header-title > span').get_text().strip()
sub_title = soup.select_one('div.journal-content-article > h2').get_text()
author = soup.select_one('div.author > p').get_text().split(':')[1].strip()

Explanation of Code

We use the get method for requests to grab the HTTP response. Beautiful soup, requires that response with .text. You will often seen html.content but that is binary response so don't use that. HTML parser is just the parser beautifulsoup uses to parse the html correctly.

We then use CSS selectors to choose the data you want. In the variabl title we use select_one which will select only one of a list of elements, as sometimes your CSS selector will provide you a list of HTML tags. If you don't know about CSS selectors here are some resources.

Essentially in the title variable we specify the html tag, the . signifies a class name, so h1.header-title will grab the html tag h1 with class header-title. The > directs you towards the direct child of h1 and in this case we want the span element that is the child element of the H1.

Also in the title variable we have the get_text() method grabs the text from the html tag. We then using the string strip method strip the string of whitespace.

Similar for the sub_title variable we are grabbing the div element with class name journal-content-article, we're getting the direct child html tag h2 and grabbing it's text.

The author variable, we're selecting the div of class name author and getting the direct child p tag. We're grabbing the text but the underlying text had autor: NAME so using the split string method we split that string into a list of two elements, autor and NAME, I then selected the 2nd element of that list and then using the string method strip, stripped any white space from that.

If you're having problems scraping specific websites, best to make a new question and show us the code you've tried, what your specific data needs are, try be as explicit as possible with this. The URL helps us direct you to getting your scraper working.

score 2 · Answer 2 · answered Jan 15 '22 at 23:41

You need to use the Config() class in order to extract the article HTML. Here is the full code to do it.

import lxml
from newspaper import Article, Config


def extract_article_html(url):
    config = Config()
    config.fetch_images = True
    config.request_timeout = 30
    config.keep_article_html = True
    article = Article(url, config=config)

    article.download()
    article.parse()

    article_html = article.article_html

    html = lxml.html.fromstring(article_html)
    for tag in html.xpath('//*[@class]'):
        tag.attrib.pop('class')

    return lxml.html.tostring(html).decode('utf-8')


url = 'https://www.stackoverflow.com'
print(url, extract_article_html(url))

Shortcomings of Newspaper3k: How to Scrape ONLY Article HTML? Python

2 Answers2

Coding Example

Explanation of Code