3

Am trying to scrape information about 2020 THE WORLD'S HIGHEST-PAID ATHLETES EARNINGS through this link https://www.forbes.com/profile/roger-federer/?list=athletes Here is the code for the first page

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.forbes.com/profile/roger-federer/?list=athletes')
soup = BeautifulSoup(page.text, 'html.parser')
profile = soup.find(class_ = 'profile-content')
name = soup.find(class_='profile-heading__rank').next_sibling

value = profile.find(class_ = "profile-info__item-value").get_text()

stats = profile.find_all(class_ = "profile-stats__text")
age = stats[0].get_text()
sport = stats[1].get_text()
citizenship = stats[5].get_text()

photo = profile.find(class_ = "profile-photo")
image = photo.find("img")
source=image.get("src")

print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)

How can i get the details for the remaining 99 athletes through pagination by clicking next

image for showing next button

PASCAL MAGONA
  • 107
  • 1
  • 1
  • 7
  • Extract the hyperlink from the button and download it, 'clicking' the button won't work in beautifulsoup. – Jazib Dawre Jul 23 '20 at 09:49
  • Load the a elements with class "profile-nav__next" 's href That will give you the "next" URL to get. – Nathan Champion Jul 23 '20 at 09:54
  • @NathanChampion example of code – PASCAL MAGONA Jul 23 '20 at 09:59
  • I don't know Python. But this seems pretty simple with a profile.find(class_ = "profile-nav__next").get("href"). Store that in a variable, then requests.get(NextPageVariable). You may need to prepend the URI and you'll need some sort of end condition. – Nathan Champion Jul 23 '20 at 10:05

2 Answers2

3

This one will run and break if no more athletes is found:

import requests
from bs4 import BeautifulSoup
import csv
main = 'https://www.forbes.com'
athlet = None
while True:
    if athlet:
        page = requests.get(main + athlet)
    else:
        page = requests.get('https://www.forbes.com/profile/roger-federer/?list=athletes')
    soup = BeautifulSoup(page.text, 'html.parser')
    profile = soup.find(class_ = 'profile-content')
    name = soup.find(class_='profile-heading__rank').next_sibling

    value = profile.find(class_ = "profile-info__item-value").get_text()

    stats = profile.find_all(class_ = "profile-stats__text")
    age = stats[0].get_text()
    sport = stats[1].get_text()
    citizenship = stats[5].get_text()

    photo = profile.find(class_ = "profile-photo")
    image = photo.find("img")
    source=image.get("src")

    print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)
    athlet_link = soup.find('a',class_='profile-nav__next')
    if athlet_link:
        athlet = athlet_link.get('href')
    else:
        break
UWTD TV
  • 910
  • 1
  • 5
  • 11
1

Try this :

import requests
from bs4 import BeautifulSoup
import csv
main_url = 'https://www.forbes.com'
for x in range(100):
    if x == 0 :
        url = main_url + '/profile/roger-federer/?list=athletes'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    profile = soup.find(class_ = 'profile-content')
    name = soup.find(class_='profile-heading__rank').next_sibling

    value = profile.find(class_ = "profile-info__item-value").get_text()

    stats = profile.find_all(class_ = "profile-stats__text")
    age = stats[0].get_text()
    sport = stats[1].get_text()
    citizenship = stats[5].get_text()

    photo = profile.find(class_ = "profile-photo")
    image = photo.find("img")
    source=image.get("src")

    print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)
    
    url = soup.find('a',class_='profile-nav__next')
    if url :
        url = main_url + url.get('href')
    else :
        break

Ouptut :

Roger Federer  $106.3M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ed53e8fa40c3d0007ed25b3%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D509%26cropX2%3D1693%26cropY1%3D78%26cropY2%3D1262 38 Tennis Switzerland
Cristiano Ronaldo  $105M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ec593cc431fb70007482137%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D1321%26cropX2%3D3300%26cropY1%3D114%26cropY2%3D2093 35 Soccer Portugal
Lionel Messi  $104M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ec595d45f39760007b05c07%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D989%26cropX2%3D2480%26cropY1%3D74%26cropY2%3D1564 33 Soccer Argentina
...
Bhargav Desai
  • 941
  • 1
  • 5
  • 17
  • its working but its breaking after sometime , so where can i introduce sleep? – PASCAL MAGONA Jul 23 '20 at 10:29
  • My opinion is this script have two main problems 1 you will get an error at the end. 2 If the list was 101 or any other number you also get problems. A much better way is to break when no more athletes is found. – UWTD TV Jul 23 '20 at 10:31
  • IF you want @PASCAL you can try my solution. I don't think you will getting errors from it – UWTD TV Jul 23 '20 at 10:44
  • my code get error at the end. you can try @PASCAL MAGONA answer. i have updated my answer, now it does give error at the end and work until 100 pages. – Bhargav Desai Jul 23 '20 at 11:07