3

I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis).

I have a list of company names, which I would like to get certain identifiers (CIK) from the above website.
Landauer Inc --> 0000825410.
Starwood Waypoint Homes --> 0001579471.
Supreme Industries Inc --> 0000350846.
[and 2,000 more ...]

Example: Searching for the first entry in the latter list (Landauer Inc), I can get the CIK using the following link: https://sec.report/CIK/Search/Landauer%20Inc. The generic link is https://sec.report/CIK/Search/{company_name}.

Problem: When I send a simple request (Python) to the above URL, I get an HTTP 200 response. Yet, I only get shown a website saying: Please wait up to 5 seconds.... Please see the response here: Loading page when request is sent.
I assume the website is protected by Cloudfare due to https://checkforcloudflare.selesti.com/?q=https://sec.report/

Try-outs: I have already tried to crawl the page using Python with:
(1) Tor-proxies with full request headers (rotating).
(2) Selenium including Cloudfare packages/extensions.
(3) Simple scrapy spider (I've never used scrapy so that I could have missed a working solution)

Does someone of you have an idea how I could bypass the protection to crawl the necessary data?
Thanks a lot in advance!

lkick
  • 31
  • 2

1 Answers1

0

You may take a look at this : implicit wait

driver.implicitly_wait(10) # seconds

With that line of code every time you try to select an element on the page selenium will try to get it for 10 seconds (or more if you want) and raise an error if not found

rafalou38
  • 576
  • 1
  • 5
  • 16