3

So if I use await page.waitFor(9000) or some hard coded wait number, my function will wait till page loads.

However, await page.goto(url, {'waitUntil': 'networkidle0'}) results in function running before entire page loads, so script fails.

Here is the entire code:

import requests
from bs4 import BeautifulSoup
import time
import os
import pyppeteer
from pyppeteer import launch
import asyncio
import subprocess



AGENT_DIR = os.path.dirname(__file__) + r'\data\agents'
SAVE_FILE = os.path.join(AGENT_DIR, 'latest.txt')
URL = 'https://techblog.willshouse.com/2012/01/03/most-common-user-agents/'


def get_latest_agents():

    ''' We are getting most
        common lastest user agents
        from the {URL} site
        and then saving it to text file {SAVE_FILE}

    '''

    
    
    async def scrape():

        url = URL
        browser = await launch(headless = False)
        page = await browser.newPage()
        await page.goto(url, {'waitUntil': 'networkidle0'})
        await page.waitFor(9000)
        
        content = await page.content()
        
        soup = BeautifulSoup(content, 'html.parser')
    
        agents = soup.select('.get-the-list')[0].text
        #agents = agents.split('\n')

        print(agents)
        
        await browser.close()

    loop = asyncio.get_event_loop()
    response = loop.run_until_complete(scrape())

if __name__ == '__main__':

    # first kill all chrome.exe as pypetter doesn't close properly
    subprocess.call(['taskkill', '/F', '/im', 'chrome.exe'])
    get_latest_agents()

Thank you.

MasayoMusic
  • 594
  • 1
  • 6
  • 24

1 Answers1

3

The code here is overcomplicated. Pyppeteer already has selectors, so there's no need for BeautifulSoup, requests, or the other unused libs/variables that might be adding to the confusion.

BS is a static HTML parser that is typically used with requests, whereas Pyppeteer is a driver that works with the browser in real-time. The only reason to use BS is if all of the data is available statically, in which case there's no need for Pyppeteer.

Pyppeteer offers a function page.waitForSelector which lets you do just what you need -- block the code until a selector you want the data from is ready. Once it is, you can extract the value with page.Jeval or a similar function that lets you run code in the browser console.

"networkidle2" can only slow you down since waitForSelector may well find the data you need well before only 2 network requests are outstanding.

Here's a simple example:

import asyncio
from pyppeteer import launch


URL = "https://techblog.willshouse.com/2012/01/03/most-common-user-agents/"


async def scrape():
    browser = await launch(headless=False)
    page, = await browser.pages()
    await page.goto(URL, {"waitUntil": "domcontentloaded"})
    await page.waitForSelector(".get-the-list", timeout=1e5)
    agents = await page.Jeval(".get-the-list", "e => e.value")
    await browser.close()
    return agents


if __name__ == "__main__":
    print(asyncio.run(scrape()))
ggorlen
  • 44,755
  • 7
  • 76
  • 106