0

I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.

I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False) is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?

Code below replicates issue for me - you should see the wikipedia page (urls[3]) print as desired, the politico (urls[0]) missing all text in the article and economist (urls[1]) missing all but one paragraph.

from bs4 import BeautifulSoup
import requests

urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175", 
        "https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
        "https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
        "https://en.wikipedia.org/wiki/World_War_II"]

# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")

# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]

def read(soup):
    for tag in soup.find_all(True, recursive=False):

        if (tag.name in tags_of_interest): 
            print(tag.name  + ": ", tag.text.strip())

        for child in tag.find_all(True, recursive=False):
            read(child)

# call the function
read(soup)
Nimantha
  • 6,405
  • 6
  • 28
  • 69

1 Answers1

0

BeautifulSoup's find_all() will return a list of tags in the order of a DFT (depth first traversal) as per this answer here. This allows easy access to the desired elements.