I want to extract the title, subtitle and link of a website, but I encounter a problem. There are two spans after about 30-35 titles that cause a conflict and prevent me from getting any more titles. How can I solve this issue?
Code:
import time
import pandas
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
web = "https://www.thesun.co.uk/sport/football/"
driver.get(web)
containers = driver.find_elements(by="xpath", value='//div[@class="teaser__copy-container"]')
titles = []
subtitles = []
links = []
for container in containers:
title = container.find_element(by='xpath', value='./a/span').text
subtitle = container.find_element(by='xpath', value='./a/h3').text
link = container.find_element(by='xpath', value='./a').get_attribute('href')
titles.append(title)
subtitles.append(subtitle)
links.append(link)
my_dict = {'title': titles, 'subtitle': subtitles, 'link': links}
fd = pandas.DataFrame(my_dict)
fd.to_csv('myextract.csv')
Screenshot of two spans after 30-35 titles:
I attempted to extract the class of span instead of the span itself, but it made my program crash and display an error.