1

I need to create a server to which I can make REST requests by obtaining the scraped data from the indicated site.

For example a url like this:

http://myip/scraper?url=www.exampe.com&token=0

I have to scrape a site built in javascript that recognizes if it is opened by a real or headless browser.

The only alternative are selenium or pyppeteer and a virtualDisplay.

I currently use selenium and FastAPI, but it is not a usable solution with a lot of requests. For each request chrome is opened and closed, this delays the response a lot and uses a lot of resources.

With pyppeteer async you can open multiple tabs at the same time on the same browser instance, reducing response times. But this would likely lead to other problems after a number of tabs.

I was thinking of creating a pool of browser instances on which to divide the various requests as puppeteer-cluster.

But so far I haven't been able to figure it out.

I was currently trying this code for Browser:

import json
from pyppeteer import launch
from strings import keepa_storage


class Browser:
    async def __aenter__(self):
        self._session = await launch(headless=False, args=['--no-sandbox', "--disable-gpu", '--lang=it',
                                                     '--disable-blink-features=AutomationControlled'], autoClose=False)
        return self

    async def __aexit__(self, *err):
        self._session = None

    async def fetch(self, url):
        page = await self._session.newPage()
        page_source = None
        try:
            await page.goto("https://example.com/404")

            for key in keepa_storage:
                await page.evaluate(
                    "window.localStorage.setItem('{}',{})".format(key, json.dumps(example_local_storage.get(key))))

            await page.goto(url)
            await page.waitForSelector('#tableElement')
            page_source = await page.content()
            
        except TimeoutError as e:
            print(f'Timeout for: {url}')
        finally:
            await page.close()
            return page_source

And this code for the request:

async with Browser() as http:
    source = await asyncio.gather(
        http.fetch('https://example.com')
    )

But I have no idea how to reuse the same browser session for multiple server requests

tecn603
  • 211
  • 5
  • 14
  • Just few days ago, I encountered a similar problem. We resolved it by using Manager-Worker formation. Where manager maintains a Queue of workers. Each worker is basically a chromedriver instance in headless mode. Once manager receives a request it dispatches a worker to consume the item. On successful completion it puts the worker back to queue else restarts the worker and put it back to the queue. – Roy May 17 '21 at 13:45
  • The idea is theoretically not bad, but always open only one link for each chromedriver instance. I wanted to try to open even more urls for each browser instance. Do you have an example of your work? – tecn603 May 17 '21 at 13:49
  • We can open multiple tabs in each Chromedriver. I can't share the entire code. Sorry for that. I will share the overall interface for the two classes in some time. – Roy May 17 '21 at 13:54
  • Thank you, in the meantime I try to see how to do it. – tecn603 May 17 '21 at 13:56

1 Answers1

1

While initialising the Server, create a Manager object. As per the implementation manager automatically spawns all the Worker needed. In the API implementation method invoke manager.assign(item). This should get an idle worker and assign the item to it. If no worker is idle at the moment, due to the Queue nature of the manager._AVAILABLE_WORKER it should wait till a worker is available. On a different thread create an infinite loop and invoke manager.heartbeat() to make sure that workers are not slacking off.

I have mentioned in the comment section what is the purpose of each method and what it is supposed to do. That should be enough to get you started. Feel free to let me know in case further clarification is required.

import Queue

class Worker:
    ###
    # class to define behavior and parameters of workers
    ###

    def __init__(self, base_url):
        ###
        # Initialises a worker
        # STEP 1. Create one worker with given inputs
        # STEP 2. Mark the worker busy
        # STEP 3. Get ready for item consumption with initialisation/login process done
        # STEP 4. Mark the worker available and active
        ###
        raise NotImplementedError()

    def process_item(self, **item):
        ###
        # Worker processes the given item and returns data to manager
        # Step 1. worker marks himself busy
        # Step 2. worker processes the item. Handle Errors here
        # Step 3. worker marks himself available
        # Step 4. Return the data scraped
        ###
        raise NotImplementedError()

class Manager:
    ###
    # class for manager who supervises all the workers and assigns work to them
    ###

    def __init__(self):
        self._WORKERS = set()  # set container to hold all the workers details
        self._AVAILABLE_WORKERS = Queue(maxsize=10)  # queue container to hold available workers
        # create all the worker we want and add them to self._WORKERS and self._AVAILABLE_WORKERS

    def assign(self, item):
        ###
        # Assigns an item to a worker to be processed and once processed returns data to the server
        # STEP 1. remove worker from available pool
        # STEP 2. assign item to worker
        # STEP 3A. if item is successfully processed, put the worker back to available pool
        # STEP 3B. if error occurred during item processing, try to reset the worker and put the worker back to
        # available pool
        ###
        raise NotImplementedError()

    def heartbeat(self):
        ###
        # process to check if all the workers are active and accounted for at particular interval.
        # if the worker is available but not in the pool add it to the pool after checking if it's not busy
        # if the worker is not active then reset the worker and add it to the pool
        ###
        raise NotImplementedError()
Roy
  • 344
  • 2
  • 12