1

You know that when information is parsed from a particular site, depending on how the code is written, it switches to the next page after the information from one or another has already been taken. Typically, this happens when a value is set. If the value is 21, then 21 pages will be parsed. Here is the code that prints information from a site with anime and animated series.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://hd8.4lordserials.xyz/anime-serialy'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 21
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/page/{page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.th-item'):
        title = item.select_one('.th-title').text
        url = item.a['href']
        items.append({
            'title': title,
            'url': url,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False) 

There are 21 pages in total. But what if there are 22 pages? Or 23? I won't re-enter the value. How can I make page switching happen automatically? That is, so that the user does not set a value, but simply everything happens on its own, and so that the code displays as many pages as there are on the site.

Galo Galo
  • 23
  • 5
  • You could keep raising the value until you hit a 404 or some other error. Or maybe you can find out if there's a next page if the current page has a "next" button – Yarin_007 Jul 07 '23 at 07:13
  • Or, normally, the actual number of pages are returned in the headers. Sometimes with a link to the next page. – Cow Jul 07 '23 at 07:15
  • Try to use a VPN. If you're from Russia. – Galo Galo Jul 07 '23 at 07:17

1 Answers1

1

The home page/BASE_URL itself contains information about the total number of pages, first, scrape the maximum page number and iterate over it to get the data from all the available pages.

Here's the implementation:

import json
import requests
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://hd8.4lordserials.xyz/anime-serialy'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
def scrape_page(url):
    rs = session.get(url, verify=False)
    rs.raise_for_status()
    soup = BeautifulSoup(rs.content, 'html.parser')

    for item in soup.select('.th-item'):
        title = item.select_one('.th-title').text
        url = item.a['href']
        items.append({
            'title': title,
            'url': url,
        })

def scrape_all_pages(base_url):
    response = session.get(base_url, verify=False).text
    soup = BeautifulSoup(response, 'html.parser')
    max_page = int(soup.select('div.navigation>a')[-1].text)
    print(f"maximum pages: {max_page}")

    for page in range(1, max_page + 1):
        page_url = f'{base_url}/page/{page}/'
        print(f"page url: {page_url}")
        scrape_page(page_url)

scrape_all_pages(BASE_URL)
print(f"total items: {len(items)}")

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

output:

maximum pages: 21
page url: https://hd8.4lordserials.xyz/anime-serialy/page/1/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/2/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/3/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/4/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/5/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/6/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/7/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/8/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/9/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/10/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/11/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/12/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/13/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/14/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/15/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/16/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/17/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/18/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/19/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/20/
page url: https://hd8.4lordserials.xyz/anime-serialy/page/21/
total items: 497

The file out.json:

[
    {
        "title": "Принесённая в жертву Принцесса и Царь зверей",
        "url": "https://hd8.4lordserials.xyz/13694-prinesyonnaya-v-zhertvu-princessa-i-car-zverei.html"
    },
    {
        "title": "Магическая битва",
        "url": "https://hd8.4lordserials.xyz/6707-magicheskaya-bitva.html"
    },
    {
        "title": "В бегах: Великая миссия",
        "url": "https://hd8.4lordserials.xyz/13702-v-begah-velikaya-missiya.html"
    },
    {
        "title": "Маги: Волшебный лабиринт",
        "url": "https://hd8.4lordserials.xyz/13796-magi-volshebnyi-labirint.html"
    },
    {
        "title": "Бессонница после школы",
        "url": "https://hd8.4lordserials.xyz/13587-bessonnica-posle-shkoly.html"
    },
    {
        "title": "История о мононокэ",
        "url": "https://hd8.4lordserials.xyz/8370-istoriya-o-mononoke.html"
    },
    {
        "title": "Йохане из паргелия: Солнечный свет в зеркале",
        "url": "https://hd8.4lordserials.xyz/13826-iohane-iz-pargeliya-solnechnyi-svet-v-zerkale.html"
    },
    {
        "title": "Боевой континент 2: Непревзойдённый клан Та",
        "url": "https://hd8.4lordserials.xyz/13824-boevoi-kontinent-2-neprevzoidyonnyi-klan-ta.html"
    },
    {
        "title": "Синий оркестр",
        "url": "https://hd8.4lordserials.xyz/13643-sinii-orkestr.html"
    },
    {
        "title": "Нулевой Эдем",
        "url": "https://hd8.4lordserials.xyz/6597-nulevoi-edem.html"
    },
    {
        "title": "Мобильный воин Гандам: Ведьма с Меркурия",
        "url": "https://hd8.4lordserials.xyz/7154-mobilnyi-voin-gandam-vedma-s-merkuriya.html"
    },
    {
        "title": "Адский рай",
        "url": "https://hd8.4lordserials.xyz/13561-adskii-rai.html"
    },
    {
        "title": "Неповторимый еженедельник боевых искусств",
        "url": "https://hd8.4lordserials.xyz/13825-nepovtorimyi-ezhenedelnik-boevyh-iskusstv.html"
    },
    {
        "title": "Магия и мускулы",
        "url": "https://hd8.4lordserials.xyz/13768-magiya-i-muskuly.html"
    },
    {
        "title": "Единорог: Вечные воины",
        "url": "https://hd8.4lordserials.xyz/13689-edinorog-vechnye-voiny-w1.html"
    },
    {
        "title": "Причина полюбить её",
        "url": "https://hd8.4lordserials.xyz/13642-prichina-polyubit-eyo-w5.html"
    },
    {
        "title": "Я получил читерские способности в другом мире и стал экстраординарным в реальном мире: История о том, как повышение уровня изменило мою жизнь",
        "url": "https://hd8.4lordserials.xyz/13697-ya-poluchil-chiterskie-sposobnosti-v-drugom-mire-i-stal-ekstraordinarnym-v-realnom-mire-istoriya-o-tom-kak-povyshenie-urovnya-izmenilo-moyu-zhizn.html"
    },
    .
    .
    .
    .
    {
        "title": "Рейтинг короля",
        "url": "https://hd8.4lordserials.xyz/6604-reiting-korolya.html"
    },
    {
        "title": "Усио и Тора",
        "url": "https://hd8.4lordserials.xyz/6603-usio-i-tora.html"
    },
    {
        "title": "Эхо террора",
        "url": "https://hd8.4lordserials.xyz/6602-eho-terrora.html"
    },
    {
        "title": "Невеста чародея: В ожидании путеводной звезды",
        "url": "https://hd8.4lordserials.xyz/6600-nevesta-charodeya-v-ozhidanii-putevodnoi-zvezdy.html"
    },
    {
        "title": "Юру Юри",
        "url": "https://hd8.4lordserials.xyz/6601-yuru-yuri.html"
    },
    {
        "title": "Ди. Грэй-мен: Святые",
        "url": "https://hd8.4lordserials.xyz/6598-di-grei-men-svyatye-w1.html"
    },
    {
        "title": "Принцессы-полудемоны",
        "url": "https://hd8.4lordserials.xyz/6599-princessy-poludemony.html"
    },
    {
        "title": "Мыши-рокеры с Марса",
        "url": "https://hd8.4lordserials.xyz/6596-myshi-rokery-s-marsa.html"
    },
    {
        "title": "Суперзлодеи",
        "url": "https://hd8.4lordserials.xyz/6595-superzlodei.html"
    },
    {
        "title": "Связанные небом",
        "url": "https://hd8.4lordserials.xyz/6593-svyazannye-nebom.html"
    },
    {
        "title": "Ди.Грэй-мен",
        "url": "https://hd8.4lordserials.xyz/6594-digrei-men.html"
    },
    {
        "title": "Механическая планета",
        "url": "https://hd8.4lordserials.xyz/6592-mehanicheskaya-planeta.html"
    },
    {
        "title": "WIXOSS: Заражённый селектор",
        "url": "https://hd8.4lordserials.xyz/6591-wixoss-zarazhyonnyi-selektor.html"
    },
    {
        "title": "Платиновый предел",
        "url": "https://hd8.4lordserials.xyz/6590-platinovyi-predel.html"
    },
    {
        "title": "Манкацу",
        "url": "https://hd8.4lordserials.xyz/6589-mankacu.html"
    },
    {
        "title": "Школа мертвецов",
        "url": "https://hd8.4lordserials.xyz/6588-shkola-mertvecov.html"
    },
    {
        "title": "Розарио + Вампир",
        "url": "https://hd8.4lordserials.xyz/6586-rozario-vampir.html"
    },
    {
        "title": "Паладин издалека",
        "url": "https://hd8.4lordserials.xyz/6587-paladin-izdaleka.html"
    },
    {
        "title": "Атака титанов: Потерянные девушки",
        "url": "https://hd8.4lordserials.xyz/6585-ataka-titanov-poteryannye-devushki-w3.html"
    },
    {
        "title": "Битвы маленьких гигантов",
        "url": "https://hd8.4lordserials.xyz/6584-bitvy-malenkih-gigantov.html"
    },
    {
        "title": "В другом мире с мужчиной, обратившимся красоткой",
        "url": "https://hd8.4lordserials.xyz/6583-v-drugom-mire-s-muzhchinoi-obrativshimsya-krasotkoi.html"
    },
    {
        "title": "Девушки на линии фронта",
        "url": "https://hd8.4lordserials.xyz/6582-devushki-na-linii-fronta.html"
    },
    {
        "title": "Истории Коёми",
        "url": "https://hd8.4lordserials.xyz/6581-istorii-koyomi.html"
    },
    {
        "title": "История цветов",
        "url": "https://hd8.4lordserials.xyz/6580-istoriya-cvetov.html"
    },
    {
        "title": "Сильнейший мудрец со слабейшей меткой",
        "url": "https://hd8.4lordserials.xyz/6578-silneishii-mudrec-so-slabeishei-metkoi.html"
    },
    {
        "title": "Ярость Бахамута: Генезис",
        "url": "https://hd8.4lordserials.xyz/6576-yarost-bahamuta-genezis.html"
    },
    {
        "title": "Саюки: Перезарядка — Зероин",
        "url": "https://hd8.4lordserials.xyz/6574-sayuki-perezaryadka-zeroin.html"
    }
]

I hope it solves your problem.

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24