I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.
I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False)
is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?
Code below replicates issue for me - you should see the wikipedia page (urls[3]
) print as desired, the politico (urls[0]
) missing all text in the article and economist (urls[1]
) missing all but one paragraph.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175",
"https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
"https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
"https://en.wikipedia.org/wiki/World_War_II"]
# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")
# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]
def read(soup):
for tag in soup.find_all(True, recursive=False):
if (tag.name in tags_of_interest):
print(tag.name + ": ", tag.text.strip())
for child in tag.find_all(True, recursive=False):
read(child)
# call the function
read(soup)