I am trying to scrape the Reuters website for all the news headlines related to the Middle East. Link to the webpage: https://www.reuters.com/subjects/middle-east
This page automatically shows previous headlines as I scroll down but while I look at the page source, it only gives the last 20 headline links.
I tried to look for a next or previous hyperlink that usually is present for such problems but unfortunately, there isn't any such hyperlink on this page.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/subjects/middle-east'
result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')
# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
if re.search('article', hl['href']):
links.append(hl['href'])
# The first link is the page itself and so we skip it
links = links[1:]
# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
if url not in urls:
urls.append(url)
# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))
I have very limited experience with all of this but my best guess would be that the java or whatever code language the page is using makes it produce the previous results when scrolled down and is perhaps what I need to figure out to do using some module of Python.
The code goes further to extract other details from each of these links but that is irrelevant to the posted problem.