5

I'm trying to do some web-scraping, as learning, using a predefined number of workers.

I'm using None as as sentinel to break out of the while loop and stop the worker.

The speed of each worker varies, and all workers are closed before the last url is passed to gather_search_links to get the links.

I tried to use asyncio.Queue, but I had less control than with deque.

async def gather_search_links(html_sources, detail_urls):
    while True:
        if not html_sources:
            await asyncio.sleep(0)
            continue

        data = html_sources.pop()
        if data is None:
            html_sources.appendleft(None)
            break
        data = BeautifulSoup(data, "html.parser")
        result = data.find_all("div", {"data-component": "search-result"})
        for record in result:
            atag = record.h2.a
            url = f'{domain_url}{atag.get("href")}'
            detail_urls.appendleft(url)
        print("apended data", len(detail_urls))
        await asyncio.sleep(0)


async def get_page_source(urls, html_sources):
    client = httpx.AsyncClient()
    while True:
        if not urls:
            await asyncio.sleep(0)
            continue

        url = urls.pop()
        print("url", url)
        if url is None:
            urls.appendleft(None)
            break

        response = await client.get(url)
        html_sources.appendleft(response.text)
        await asyncio.sleep(8)
    html_sources.appendleft(None)


async def navigate(urls):
    for i in range(2, 7):
        url = f"https://www.example.com/?page={i}"
        urls.appendleft(url)
        await asyncio.sleep(0)
    nav_urls.appendleft(None)


loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()

navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])

loop.run_until_complete(workers)
user3541631
  • 3,686
  • 8
  • 48
  • 115
  • It is not clear what is the incorrect behavior. "All the workers are closed before all urls are processed" ? Is that the incorrect behavior. Reading the code, it seems more like the program continue to run even if there is no more urls to process. – amirouche Oct 23 '20 at 10:10
  • the first worker for "get_page_source" introduce a None in the deque. The "gather_search_links" worker sees that in the deque there is just a None and stops(break). The issues is that the second worker "get_page_source" didn't had time to push his data in the deque, and the "gather_search_links" worker is already closed, so the last item is not processed. – user3541631 Oct 25 '20 at 07:34
  • You pass to `navigate` as argument `urls` the value `nav_urls` but then within that function you hardcode `nav_urls.appendleft(None)`. The logic of that escapes me. – Booboo Oct 27 '20 at 11:28

1 Answers1

2

I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.

I think the issue arises at this line:

workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])

According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.

I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.

EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.

LAST EDIT: Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.

import asyncio
from collections import deque

import httpx as httpx
from bs4 import BeautifulSoup

# Get or build URLs from config
def navigate():
    urls = deque()
    for i in range(2, 7):
        url = f"https://www.example.com/?page={i}"
        urls.appendleft(url)
    return urls

# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):

    client = httpx.AsyncClient()
    response = await client.get(url)
    data = BeautifulSoup(response.text, "html.parser")
    result = data.find_all("div", {"data-component": "search-result"})
    for record in result:
        atag = record.h2.a
        #Domain URL was defined elsewhere
        url = f'{domain_url}{atag.get("href")}'
        products_urls.appendleft(url)


loop = asyncio.get_event_loop()
products_urls = deque()

nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])

loop.run_until_complete(workers)
TheFunk
  • 981
  • 11
  • 39
  • I don't want to wait for each url. The example is simple, but as idea, I want to manipulate the data and have database communication and maybe the navigation can use multiple sources. If is done synchronously, I need to wait for every step. Is like having multiple producer-consumers. – user3541631 Oct 29 '20 at 12:03
  • @user3541631 the thing is, you are doing each step synchronously above. You defined asynchronous functions, but then you're waiting on your URL to populate to fetch that page's HTML and then you're waiting on that page's HTML to download before you parse that HTML. The example I put above just removes the need for your wait statements. You could still fetch your URLs from multiple sources asynchronously but the function navigate is now blocking, removing your need to write a sentinel and other blocking code within a function that is supposed to be nonblocking. – TheFunk Oct 29 '20 at 12:29