Scraping data from a website with Infinite Scroll?

Question

I am trying to scrape a website for titles as well as other items but for the sake of brevity, just game titles.

I have tried using selenium and beautiful soup in tandem to grab the titles, but I cannot seem to get all the September releases no matter what I do. In fact, I get some of the August game titles as well. I think it has to do with the fact that there is no ending to the website. How would I grab just the September titles? Below is the code I used and I have tried to use Scrolling but I do not think I understand how to use it properly.

EDIT: My goal is to be able to eventually get each month by changing a few lines of code.

from selenium import webdriver
from bs4 import BeautifulSoup

titles = []

chromedriver = 'C:/Users/Chase The Great/Desktop/Podcast/chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.releases.com/l/Games/2019/9/')
res = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(res, 'lxml')

for title in soup.find_all(class_= 'calendar-item-title'):
    titles.append(title.text)

I am expected to get 133 titles and I get some August titles plus only part of the titles as such:

['SubaraCity', 'AER - Memories of Old', 'Vambrace: Cold Soul', 'Agent A: A Puzzle in Disguise', 'Bubsy: Paws on Fire!', 'Grand Brix Shooter', 'Legend of the Skyfish', 'Vambrace: Cold Soul', 'Obakeidoro!', 'Pokemon Masters', 'Decay of Logos', 'The Lord of the Rings: Adventure ...', 'Heave Ho', 'Newt One', 'Blair Witch', 'Bulletstorm: Duke of Switch Edition', 'The Ninja Saviors: Return of the ...', 'Re:Legend', 'Risk of Rain 2', 'Decay of Logos', 'Unlucky Seven', 'The Dark Pictures Anthology: Man ...', 'Legend of the Skyfish', 'Astral Chain', 'Torchlight II', 'Final Fantasy VIII Remastered', 'Catherine: Full Body', 'Root Letter: Last Answer', 'Children of Morta', 'Himno', 'Spyro Reignited Trilogy', 'RemiLore: Lost Girl in the Lands ...', 'Divinity: Original Sin 2 - Defini...', 'Monochrome Order', 'Throne Quest Deluxe', 'Super Kirby Clash', 'Himno', 'Post War Dreams', 'The Long Journey Home', 'Spice and Wolf VR', 'WRC 8', 'Fantasy General II', 'River City Girls', 'Headliner: NoviNews', 'Green Hell', 'Hyperforma', 'Atomicrops', 'Remothered: Tormented Fathers']

By "no ending to website" you are talking about an "infinite scroll" where a new page of contents are loaded when you scroll to the bottom of the screen, right? — Greg Burghardt, Sep 20 '19 at 13:45
How far back in time do you want to get the titles? Current month and previous month? That's the challenge with this sort of web site. When do you stop? If we know that I think we can help you better. — Greg Burghardt, Sep 20 '19 at 13:47
I just want september's content. My goal would be that each month, I just want to change one line of code to get the next month is what I hope. — Chase Sariaslani, Sep 20 '19 at 16:44
That is good information, but please add that to the question text instead of a comment. Other people don't necessarily read the comments, so they will miss this point. — Greg Burghardt, Sep 20 '19 at 16:55

Guanaco Devs · Accepted Answer · 2019-09-21T07:19:39.433

Seems to me that in order to get only september, first you want to grab only the section for september:

section = soup.find('section', {'class': 'Y2019-M9 calendar-sections'})

Then once you fetch the section for September get all the titles which are in an <a> tag like this:

for title in section.find_all('a', {'class': ' calendar-item-title subpage-trigg'}):
    titles.append(title.text)

Please note that none of the previous has been tested.

UPDATE: The problem is that everytime you want load the page, it gives you only the very first section that contains only 24 items, in order to access them you have to scroll down(infinite scroll). If you open the browser developers tool, select Network and then XHR you will notice that everytime you scroll and load the next "page" there is a request with an url similar to this:

https://www.releases.com/calendar/nextAfter?blockIndex=139&itemIndex=23&category=Games&regionId=us

Where my guess is that blockIndex is meant for the month and itemIndex is for every page loaded, if you are looking only for the month of september blockIndex will be always 139 in that request the challenge is to get the next itemIndex for the next page so you can construct your next request. The next itemIndex will be always the last itemIndex of the previous request.

I did make a script that does what you want using only BeautifulSoup. Use it at your own discretion, there are some constants that may be extracted dynamically, but I think this could give you a head start:

import json

import requests
from bs4 import BeautifulSoup

DATE_CODE = 'Y2019-M9'
LAST_ITEM_FIRST_PAGE = f'calendar-item col-xs-6 to-append first-item calendar-last-item {DATE_CODE}-None'
LAST_ITEM_PAGES = f'calendar-item col-xs-6 to-append calendar-last-item {DATE_CODE}-None'
INITIAL_LINK = 'https://www.releases.com/l/Games/2019/9/'
BLOCK = 139
titles = []


def get_next_page_link(div: BeautifulSoup):
    index = div['item-index']
    return f'https://www.releases.com/calendar/nextAfter?blockIndex={BLOCK}&itemIndex={index}&category=Games&regionId=us'


def get_content_from_requests(page_link):
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    req = requests.get(page_link, headers=headers)
    return BeautifulSoup(req.content, 'html.parser')


def scroll_pages(link: str):
    print(link)
    page = get_content_from_requests(link)
    for div in page.findAll('div', {'date-code': DATE_CODE}):
        item = div.find('a', {'class': 'calendar-item-title subpage-trigg'})
        if item:
            # print(f'TITLE: {item.getText()}')
            titles.append(item.getText())
    last_index_div = page.find('div', {'class': LAST_ITEM_FIRST_PAGE})
    if not last_index_div:
        last_index_div = page.find('div', {'class': LAST_ITEM_PAGES})
    if last_index_div:
        scroll_pages(get_next_page_link(last_index_div))
    else:
        print(f'Found: {len(titles)} Titles')
        print('No more pages to scroll finishing...')


scroll_pages(INITIAL_LINK)
with open(f'titles.json', 'w') as outfile:
    json.dump(titles, outfile)

if your goal is to use Selenium, I think the same principle may apply unless it has a scrolling capability as it is loading the page. Replacing INITIAL_LINK, DATE_CODE & BLOCK accordingly, will get you other months as well.

It has improved my search to removing august but only gives me a piece of September still. FYI, I changed your coding slightly to fit how I wrote it but should do the same thing as you suggested. — Chase Sariaslani, Sep 20 '19 at 00:53
What are you getting now? What i noticed from your question is that there is no `calendar-item-title` class, instead is ` calendar-item-title subpage-trigg`, there is a space at the beginning — Guanaco Devs, Sep 20 '19 at 02:08
It's weird, I get the same answer if I use calender-item-title or the one you suggested. I do NOT get all of the titles. I think it is because the website does not generate the entire website. That's why I am using selenium and not just beautiful soup. — Chase Sariaslani, Sep 20 '19 at 16:42
What if you query for `https://www.releases.com/l/Games/2019/`, the whole page, then trimming out the section for september might do it? — Guanaco Devs, Sep 20 '19 at 21:19
@ChaseSariaslani I updated the answer with a working script. If it helps you, perhaps you may consider accepting the answer and/or upvoting ;-) — Guanaco Devs, Sep 21 '19 at 07:13
Just to see if I understand, there are 0 titles at the end because, as you say, I need to grab the data from the prior websites that get posted below? — Chase Sariaslani, Sep 22 '19 at 00:57
Sorry, I don't follow. When we make the `driver.get('https://www.releases.com/l/Games/2019/9/')` we only get 24 titles and no more, because you have to physically scroll in the page to load the next batch of 24 titles. When you reach the end of the page a script is triggered that loads the next batch. What I did was to observe the requests that gets triggered on the pagination to understand how the requests is formed and then made a script to get the links for the next request(scroll) dynamically. Did you test it? — Guanaco Devs, Sep 22 '19 at 01:48
Yes, more or less what I said but infinitely better. Thank you! — Chase Sariaslani, Sep 22 '19 at 02:17

Scraping data from a website with Infinite Scroll?

1 Answers1