0

I want to extract the title, subtitle and link of a website, but I encounter a problem. There are two spans after about 30-35 titles that cause a conflict and prevent me from getting any more titles. How can I solve this issue?

Code:

import time
import pandas
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

web = "https://www.thesun.co.uk/sport/football/"

driver.get(web)

containers = driver.find_elements(by="xpath", value='//div[@class="teaser__copy-container"]')

titles = []
subtitles = []
links = []

for container in containers:
    title = container.find_element(by='xpath', value='./a/span').text
    subtitle = container.find_element(by='xpath', value='./a/h3').text
    link = container.find_element(by='xpath', value='./a').get_attribute('href')

    titles.append(title)
    subtitles.append(subtitle)
    links.append(link)

my_dict = {'title': titles, 'subtitle': subtitles, 'link': links}
fd = pandas.DataFrame(my_dict)
fd.to_csv('myextract.csv')


Screenshot of two spans after 30-35 titles:

enter image description here

I attempted to extract the class of span instead of the span itself, but it made my program crash and display an error.

Black cat
  • 1,056
  • 1
  • 2
  • 11

2 Answers2

0

Couldn't reproduce your problem, but it looks like that span that you need is always last span.

So, you can get span elements array and get text from last element

    title = container.find_elements(by='xpath', value='./a/span')[-1].text
Yaroslavm
  • 1,762
  • 2
  • 7
  • 15
  • It didn't work. It's not showing a TypeError. 'WebElement' object is not subscriptable. – codejerry08 Aug 28 '23 at 12:09
  • How about `title = container.find_elements(by='xpath', value='./a//span')[-1].get_attrubute('textContent')` ? – Yaroslavm Aug 28 '23 at 12:11
  • @codejerry08, make sure you are calling "find_elements" and not "find_element" The error suggests you are using the latter. – pcalkins Aug 28 '23 at 22:23
0

Assuming there are no other span tags on that page with the class t-p-fill before the one you want, then 'span.t-p-fill' or 'span[class*="t-p-fil"]' (using by='css selector') should be a unique identifier for it.

Michael Mintz
  • 9,007
  • 6
  • 31
  • 48
  • 1
    Thanks a lot, my friend. I was on this for nearly two days straight. At this point, it's successfully pulling in all the titles, although there are still a few absent titles within the middle section. But, there's a notable improvement compared to the previous state, as it's managed to scrape approximately 90% of the titles from the website. Yet, if you take a glance at the image provided above, you'll notice that both span classes commence with the same initial two letters. It's only after that point, within the middle space, that they diverge and become distinct. – codejerry08 Aug 29 '23 at 17:13