2

I'm trying to use BS4 to parse through the HTML for an about page on a youtube channel so I can scrape the number of channel views. Below is the code to scrape the channel views (located in the 'yt-formatted-string') and also the whole right column of the page. Both lines of code return either an empty list and a "None" value for the findAll() and find() functions, respectively.

I read another thread saying I may be receiving an empty list or "None" value because the page is accessing an API to get the total channel views to count and the values aren't actually in the HTML I'm parsing.

I know I could access much of this info through the Youtube API, but I want to iterate this code over multiple channels that are not my own. Moreover, I want to understand how to use BS4 to its full extent so I can replicate this process on an Instagram page or Facebook page.

Should I be using a different library that isn't BS4? Is what I'm looking to accomplish even possible?

My CODE

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

#find Youtube channel views and subscriber counts

my_url = 'https://www.youtube.com/c/Rozziofficial/about'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")

body = page_soup.body
views_count = body.find_all('yt-formmated-string',{"class":"style-scope ytd-channel-about-metadata-renderer"})
right_column = body.find('div', {"id":"right-column"})

print(right_column)
print(views_count)
MendelG
  • 14,885
  • 4
  • 25
  • 52

1 Answers1

2

YouTube is loaded dynamically, therefore urlib won't support it. However, the data is available in JSON format on the website. You can convert this data to a Python dictionary (dict) using the built-in json library.

This example is using the URL you have provided: https://www.youtube.com/c/Rozziofficial/about, you can change the channel name, it will work for all channels.

Here's an example using requests, you can use urlib instead:

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.youtube.com/c/Rozziofficial/about"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)

# Uncomment to view all the data
# print(json.dumps(data))

# This converts the JSON data to a python dictionary (dict)
json_data = json.loads(data)

# This is the info from the webpage on the right-side under "stats", it contains the data you want
stats = json_data["contents"]["twoColumnBrowseResultsRenderer"]["tabs"][5]["tabRenderer"]["content"]["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"][0]["channelAboutFullMetadataRenderer"]

print("Channel Views:", stats["viewCountText"]["simpleText"])
print("Joined:", stats["joinedDateText"]["runs"][1]["text"])

Output:

Channel Views: 10,263,762 views
Joined: Jun 30, 2007

Further reading:

MendelG
  • 14,885
  • 4
  • 25
  • 52
  • When you are indexing the variable 'stats', what is the [5] for? I understand how indexing works in dictionaries and lists, but what are you specifically using the numbered indexing for? Also, what is the best way to view the JSON data or the JSON data as a dictionary? I want to pull out subscriber count for the channel as well, but the JSON data (both in dictionary form and normal JSON form) are poorly organized in Spyder. Is there a way to look at it all organized, similar to a beautifier so that I can read and understand the dictionaries and JSON data? – Prithvi Venkataswamy Jun 15 '21 at 21:29
  • @MendelG, your code show error - `AttributeError: 'NoneType' object has no attribute 'group'`, on the row `data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)` – Hermess Mar 23 '22 at 20:29