I need to create a server to which I can make REST requests by obtaining the scraped data from the indicated site.
For example a url like this:
http://myip/scraper?url=www.exampe.com&token=0
I have to scrape a site built in javascript that recognizes if it is opened by a real or headless browser.
The only alternative are selenium or pyppeteer and a virtualDisplay.
I currently use selenium and FastAPI, but it is not a usable solution with a lot of requests. For each request chrome is opened and closed, this delays the response a lot and uses a lot of resources.
With pyppeteer async you can open multiple tabs at the same time on the same browser instance, reducing response times. But this would likely lead to other problems after a number of tabs.
I was thinking of creating a pool of browser instances on which to divide the various requests as puppeteer-cluster.
But so far I haven't been able to figure it out.
I was currently trying this code for Browser:
import json
from pyppeteer import launch
from strings import keepa_storage
class Browser:
async def __aenter__(self):
self._session = await launch(headless=False, args=['--no-sandbox', "--disable-gpu", '--lang=it',
'--disable-blink-features=AutomationControlled'], autoClose=False)
return self
async def __aexit__(self, *err):
self._session = None
async def fetch(self, url):
page = await self._session.newPage()
page_source = None
try:
await page.goto("https://example.com/404")
for key in keepa_storage:
await page.evaluate(
"window.localStorage.setItem('{}',{})".format(key, json.dumps(example_local_storage.get(key))))
await page.goto(url)
await page.waitForSelector('#tableElement')
page_source = await page.content()
except TimeoutError as e:
print(f'Timeout for: {url}')
finally:
await page.close()
return page_source
And this code for the request:
async with Browser() as http:
source = await asyncio.gather(
http.fetch('https://example.com')
)
But I have no idea how to reuse the same browser session for multiple server requests