0

So I am trying to scrape games off my steam wish-list using beautifulsoup. Ideally, I would like the name of the game, the link to the steam store page of the game and the currently listed price. The issue is that when I call soup.find_all("div", {"class": "wishlist_row"}) it returns an empty list despite me being able to see that there should be several of these divs for each game on my wish-list in the inspector. Here is a condensed version of my current code:

from bs4 import BeautifulSoup
import requests

profile_id = "id/Zorro4"

url_base = "https://store.steampowered.com/wishlist/"

r = requests.get(url_base + profile_id + "#sort=order", headers=header)

data = r.text

soup = BeautifulSoup(data, features="lxml")

# find divs containing information about game and steam price
divs = soup.findAll("div", {"class": "wishlist_row"})

print(divs)
>>> []

I can clearly see these divs in the inspector if I go to https://store.steampowered.com/wishlist/id/zorro4/#sort=order I have tried

I have noticed something odd that might help solve the problem but I am not sure what to make of it.

soup.find(id="wishlist_ctn") # The div which should contain all the wishlist_row divs
>>> <div id="wishlist_ctn">\n</div> 

This, as far as I know, should return <div id="wishlist_ctn">...</div> since the div contains more nested divs (the ones I'm looking for). I am not sure why it just returns a newline character. It's almost as though when scraping the contents of the wishlist_ctn div gets lost. Any help would be super appreciated, I've been trying to solve this for the last couple days with no success.

Jurij
  • 145
  • 1
  • 1
  • 8

2 Answers2

4

The data you see on the webpage is loaded dynamically via Javascript/JSON. The URL, from where the data is loaded is inside the HTML page - we can use re module to extract it.

This example prints the JSON data of the wishlist:

import re
import json
import requests

url = 'https://store.steampowered.com/wishlist/id/zorro4/#sort=order'
wishlist_url =  json.loads( re.findall(r'g_strWishlistBaseURL = (".*?");', requests.get(url).text)[0] )

data = requests.get(wishlist_url + 'wishlistdata/?p=0').json()
print(json.dumps(data, indent=4))

Prints:

{
    "50": {
        "name": "Half-Life: Opposing Force",
        "capsule": "https://steamcdn-a.akamaihd.net/steam/apps/50/header_292x136.jpg?t=1571756577",
        "review_score": 8,
        "review_desc": "Very Positive",
        "reviews_total": "5,383",
        "reviews_percent": 95,
        "release_date": "941443200",
        "release_string": "1 Nov, 1999",
        "platform_icons": "<span class=\"platform_img win\"></span><span class=\"platform_img mac\"></span><span class=\"platform_img linux\"></span>",
        "subs": [
            {
                "id": 32,

...and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • You should increment `p` and loop as long as you receive games. While your code might work fine for this specific steamid, it won't for others as the response from page 0 will only contains the first 100 games. – Jordan Brière Dec 21 '19 at 21:10
  • Thank you! Anyway to navigate the output using the json library or do I have to use .split() and parse it all manually? – Jurij Dec 21 '19 at 22:36
  • 1
    @Jurij No, the `data` variable is of type `dict`, so you work with it like with normal python dictionary. – Andrej Kesely Dec 21 '19 at 22:37
3

The issue is that the wishlist is actually being populated by an AJAX request. Beautiful Soup does not handle that functionality. You would need a web driver for that. Luckily the short cut here is to just use the API call made for the wishlist and parse that JSON response. In this instance that request is:

https://store.steampowered.com/wishlist/profiles/76561198068616380/wishlistdata/?p=0

Robert Brisita
  • 5,461
  • 3
  • 36
  • 35
  • Thanks so much! Could you please elaborate on how you got to that link / What is the general process to get to such a link for these cases where data is dynamically loaded via JSON? – Jurij Dec 21 '19 at 22:30
  • 1
    The Deverloper Tools on the browser will show all requests coming from the loaded page. Then you can filter out by types: images, css, js, etc. I’m assuming here but the string of digits is probably a user id or something similar; you would change that for each user. – Robert Brisita Dec 22 '19 at 01:11