Incredibly high volume async web scraping

Question

I'm working on a project and I've discovered that the data I want is stored as auto-generated PDFs on the web (not indexed by search engines). The URLs follow a consistent pattern which is basically looks something like https://www.website.com/document/ then a four-digit number, then another /, then a ten-digit number followed by .pdf. I'd like to get all the possible PDFs from these webpages, OCR them with tesseract, parse the text with PDFPlumber and then store the data in a Pandas DF/SQL DB for future ML/NLP. (Not every number combo has a pdf.)

My first thought was nested for loops + selenium, but I realized it would take ~2 seconds * 9.99 * 10^13 (~6.3 million years) so that is off the table. Then I thought about parallelizing the process with something like arsenic, but even with 128 threads on a server that still would take about 50,000 years to go through them. That's not to mention the storage problems (even if only 1/100000 is a legit webpage with a pdf that's 50TB). I could OCR, parse and delete as I go, but I suspect that would take way longer than doing this after the fact.

Is this whole process hopeless? Or is there something I can try to tackle this problem in a semi-reasonable amount of time?

Is there a way to grab only the ones on the page>? Like could you grab all the hrefs from the 4 digit number pages or etc. — Arundeep Chohan, Feb 20 '22 at 22:23
Not exactly sure what you mean, but I should clarify that each page only has one pdf file. — as9934, Feb 21 '22 at 01:47

Incredibly high volume async web scraping

0 Answers0