Python scraping too slow on youtube title from URL - html-render

Question

hi i have excel files with youtube url list which i m trying to get their titles as it's full lists of 1000's url's with 3 excel file i tried to work with python but it comes to be too slow as i had to put sleep command on html render codes are like that :

 import xlrd
import time
from bs4 import BeautifulSoup
import requests
from xlutils.copy import copy
from requests_html import HTMLSession



loc = ("testt.xls")

wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
wb2 = copy(wb)
sheet.cell_value(0, 0)

for i in range(3,sheet.nrows):


    ytlink = (sheet.cell_value(i, 0))
    session = HTMLSession()
    response = session.get(ytlink)
    response.html.render(sleep=3)
    print(sheet.cell_value(i, 0))
    print(ytlink)
    element = BeautifulSoup(response.html.html, "lxml")
    media = element.select_one('#container > h1').text
    print(media)
    s2 = wb2.get_sheet(0)
    s2.write(i, 0, media)
    wb2.save("testt.xls")

I mean is there anyway to make it faster i tried selenium but it was slower i guess. and with this html.render i seem to need to use "Sleep" timer or else it gives me error i tried lower values on sleep but it gets error after a while on lower sleep values any help please thanks :)

ps: prints i put are just for checking the output and such not important on usage.

score 2 · Answer 1 · answered Jun 05 '21 at 23:42

2

Using your current method/Selenium you are rendering the actual webpage, which you don't need to do. I recommend using a Python library that will handle it for you. Below is an example of YoutubeDL:

with YoutubeDL() as ydl:
    title = ydl.extract_info("https://www.youtube.com/watch?v=jNQXAC9IVRw", download=False).get("title", None)
    print(title)

Note that doing 1000 of these requests, with the rate limits imposed by YouTube, will still be slow. If you are planning on doing possibly thousands of requests in the future I recommend looking into getting an API key.

answered Jun 05 '21 at 23:42

Saddy

1,515
1
9
20

Oh i saw this but i thought only with api token key i could use it. I guess rate limits can slow it down but it would be still faster than previous method i guess right? as long as its not timed out or something in the middle of excel and make me start over i would be fine with little bit slowiness. As it seems no other way to take without api or render Thanks for the help i will try that :) – kintama shintoki Jun 05 '21 at 23:51

score 1 · Accepted Answer · answered Jun 06 '21 at 07:18

You can do 1000 requests in less than a minute using async requests-html like this:

import random
from time import perf_counter
from requests_html import AsyncHTMLSession

urls = ['https://www.youtube.com/watch?v=z9eoubnO-pE'] * 1000

asession = AsyncHTMLSession()
start = perf_counter()

async def fetch(url):
    r = await asession.get(url, cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))})
    return r

all_responses = asession.run(*[lambda url=url: fetch(url) for url in urls])
all_titles = [r.html.find('title', first=True).text for r in all_responses]

print(all_titles)
print(perf_counter() - start)

Done in 55s on my laptop.

Note that you need to pass cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))} to the request to avoid this issue.

tried it and it worked faster than before in a for loop to get each url in excel row of that column and replace it with title of that youtube url thanks :) codes are below for informing newbies like myself : — kintama shintoki, Jun 06 '21 at 14:58

Python scraping too slow on youtube title from URL - html-render

2 Answers2