Beautifulsoup Pagination using next button

Question

Am trying to scrape information about 2020 THE WORLD'S HIGHEST-PAID ATHLETES EARNINGS through this link https://www.forbes.com/profile/roger-federer/?list=athletes Here is the code for the first page

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.forbes.com/profile/roger-federer/?list=athletes')
soup = BeautifulSoup(page.text, 'html.parser')
profile = soup.find(class_ = 'profile-content')
name = soup.find(class_='profile-heading__rank').next_sibling

value = profile.find(class_ = "profile-info__item-value").get_text()

stats = profile.find_all(class_ = "profile-stats__text")
age = stats[0].get_text()
sport = stats[1].get_text()
citizenship = stats[5].get_text()

photo = profile.find(class_ = "profile-photo")
image = photo.find("img")
source=image.get("src")

print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)

How can i get the details for the remaining 99 athletes through pagination by clicking next

image for showing next button

Extract the hyperlink from the button and download it, 'clicking' the button won't work in beautifulsoup. — Jazib Dawre, Jul 23 '20 at 09:49
Load the a elements with class "profile-nav__next" 's href That will give you the "next" URL to get. — Nathan Champion, Jul 23 '20 at 09:54
I don't know Python. But this seems pretty simple with a profile.find(class_ = "profile-nav__next").get("href"). Store that in a variable, then requests.get(NextPageVariable). You may need to prepend the URI and you'll need some sort of end condition. — Nathan Champion, Jul 23 '20 at 10:05

UWTD TV · Accepted Answer · 2020-07-23T10:32:20.460

This one will run and break if no more athletes is found:

import requests
from bs4 import BeautifulSoup
import csv
main = 'https://www.forbes.com'
athlet = None
while True:
    if athlet:
        page = requests.get(main + athlet)
    else:
        page = requests.get('https://www.forbes.com/profile/roger-federer/?list=athletes')
    soup = BeautifulSoup(page.text, 'html.parser')
    profile = soup.find(class_ = 'profile-content')
    name = soup.find(class_='profile-heading__rank').next_sibling

    value = profile.find(class_ = "profile-info__item-value").get_text()

    stats = profile.find_all(class_ = "profile-stats__text")
    age = stats[0].get_text()
    sport = stats[1].get_text()
    citizenship = stats[5].get_text()

    photo = profile.find(class_ = "profile-photo")
    image = photo.find("img")
    source=image.get("src")

    print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)
    athlet_link = soup.find('a',class_='profile-nav__next')
    if athlet_link:
        athlet = athlet_link.get('href')
    else:
        break

Bhargav Desai · Answer 2 · 2020-07-23T11:08:59.133

Try this :

import requests
from bs4 import BeautifulSoup
import csv
main_url = 'https://www.forbes.com'
for x in range(100):
    if x == 0 :
        url = main_url + '/profile/roger-federer/?list=athletes'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    profile = soup.find(class_ = 'profile-content')
    name = soup.find(class_='profile-heading__rank').next_sibling

    value = profile.find(class_ = "profile-info__item-value").get_text()

    stats = profile.find_all(class_ = "profile-stats__text")
    age = stats[0].get_text()
    sport = stats[1].get_text()
    citizenship = stats[5].get_text()

    photo = profile.find(class_ = "profile-photo")
    image = photo.find("img")
    source=image.get("src")

    print(name+" " +" "+ value+" "+ source+" "+age+" "+sport+" "+citizenship)
    
    url = soup.find('a',class_='profile-nav__next')
    if url :
        url = main_url + url.get('href')
    else :
        break

Ouptut :

Roger Federer  $106.3M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ed53e8fa40c3d0007ed25b3%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D509%26cropX2%3D1693%26cropY1%3D78%26cropY2%3D1262 38 Tennis Switzerland
Cristiano Ronaldo  $105M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ec593cc431fb70007482137%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D1321%26cropX2%3D3300%26cropY1%3D114%26cropY2%3D2093 35 Soccer Portugal
Lionel Messi  $104M https://thumbor.forbes.com/thumbor/fit-in/416x416/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5ec595d45f39760007b05c07%2F0x0.jpg%3Fbackground%3D000000%26cropX1%3D989%26cropX2%3D2480%26cropY1%3D74%26cropY2%3D1564 33 Soccer Argentina
...

its working but its breaking after sometime , so where can i introduce sleep? — PASCAL MAGONA, Jul 23 '20 at 10:29
My opinion is this script have two main problems 1 you will get an error at the end. 2 If the list was 101 or any other number you also get problems. A much better way is to break when no more athletes is found. — UWTD TV, Jul 23 '20 at 10:31
IF you want @PASCAL you can try my solution. I don't think you will getting errors from it — UWTD TV, Jul 23 '20 at 10:44
my code get error at the end. you can try @PASCAL MAGONA answer. i have updated my answer, now it does give error at the end and work until 100 pages. — Bhargav Desai, Jul 23 '20 at 11:07

Beautifulsoup Pagination using next button

2 Answers2