Crawling issue with loading page using Python (wait up to 5 seconds)

Question

I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis).

I have a list of company names, which I would like to get certain identifiers (CIK) from the above website.
Landauer Inc --> 0000825410.
Starwood Waypoint Homes --> 0001579471.
Supreme Industries Inc --> 0000350846.
[and 2,000 more ...]

Example: Searching for the first entry in the latter list (Landauer Inc), I can get the CIK using the following link: https://sec.report/CIK/Search/Landauer%20Inc. The generic link is https://sec.report/CIK/Search/{company_name}.

Problem: When I send a simple request (Python) to the above URL, I get an HTTP 200 response. Yet, I only get shown a website saying: Please wait up to 5 seconds.... Please see the response here: Loading page when request is sent.
I assume the website is protected by Cloudfare due to https://checkforcloudflare.selesti.com/?q=https://sec.report/

Try-outs: I have already tried to crawl the page using Python with:
(1) Tor-proxies with full request headers (rotating).
(2) Selenium including Cloudfare packages/extensions.
(3) Simple scrapy spider (I've never used scrapy so that I could have missed a working solution)

Does someone of you have an idea how I could bypass the protection to crawl the necessary data?
Thanks a lot in advance!

I just had a try and used postman to see the response and i didnt get the "wait 5 seconds" issue. Worst case you could automate postman to accomplish your goals. — Tomek, Jan 03 '21 at 23:47

score 0 · Answer 1 · answered Jan 05 '21 at 07:33

You may take a look at this : implicit wait

driver.implicitly_wait(10) # seconds

With that line of code every time you try to select an element on the page selenium will try to get it for 10 seconds (or more if you want) and raise an error if not found

Crawling issue with loading page using Python (wait up to 5 seconds)

1 Answers1