16

I faced with cloudflare issue when I tried to parse the website.

I got this code

import cloudscraper

url = "https://author.today"
scraper = cloudscraper.create_scraper()
print(scraper.post(url).status_code)

This code prints me

cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

I searched for workaround, but couldn't find any solution. If visit the website via a browser you could see

Checking your browser before accessing author.today.

Is there any solution to bypass cloudflare in my case?

Nickolas
  • 283
  • 1
  • 3
  • 7
  • 4
    The exception message implies a solution. – Klaus D. Jan 06 '21 at 23:14
  • `not available in the opensource (free) version` - so pay for this. – furas Jan 07 '21 at 00:32
  • 2
    There is apparently, "no paid version". However the docs states: ```Cloudflare modifies their anti-bot protection page occasionally, So far it has changed maybe once per year on average. If you notice that the anti-bot page has changed, or if this module suddenly stops working, please create a GitHub issue so that I can update the code accordingly.```. It stopped working for me too suddenly, so I assume they changed strategy – Bastien Bastien Jan 07 '21 at 15:27
  • 5
    Interestingly though, even when I copy the chrome request and resend it (with all cookies) from curl, using the same IP, it doesn't seem to fool CloudFlare. I wonder why that is and how would cloudflare differentiate my browser from cURL, when they both make the same request. (nb, that method of copying the request headers, used to work... not anymore though...) – Bastien Bastien Jan 07 '21 at 15:30
  • The exception indeed contains a hint. But I didn't find any not free version. – Nickolas Jan 08 '21 at 00:24
  • @Nickolas, have you found any solution? – shawnngtq Mar 20 '21 at 15:08
  • seem, made fun on our. – Nabi K.A.Z. May 26 '21 at 09:59
  • I'm scrapping 670 pages, the code works well till page 100 and then throws this exception. Did any of you guys find any solution or an alternate method? @shawnngtq Nabi K.A.Z. nickolas – Madhur Yadav Jul 04 '21 at 01:50
  • No, I didn't find any solution yet. If somebody find it, please let me know @MadhurYadav – Nickolas Jul 05 '21 at 09:28
  • @MadhurYadav In your case, maybe you could just scrape 100 pages, wait 10, 20, 30 (who knows?) minutes or so, then scrape another 100 pages, etc. By the way, there is no paid version of cloudscraper— it's just really hard to keep up with Cloudflare strategies. – Tommy A. Nov 03 '21 at 23:22
  • @BastienBastien they do, among other things, SSL handshake fingerprinting. And Chrome use BoringSSL as library. – Paolo Feb 21 '22 at 14:12
  • @Paolo it seems that the modern viable solution is now to use selenium, just like FlarSolverr does: https://github.com/FlareSolverr/FlareSolverr – Bastien Bastien Feb 21 '22 at 14:34

6 Answers6

3

Install httpx

pip3 install httpx[http2]

Define http2 client

client = httpx.Client(http2=True)

Make request

response = client.get("https://author.today")

Cheers!

Zorome
  • 106
  • 8
1

Although for this site is does not seem to work, sometimes adding some parameters when initializing the scraper helps:

import cloudscraper

url = "https://author.today"
scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'android',
        'desktop': False
    }
)
print(scraper.post(url).status_code)
dcts
  • 1,479
  • 15
  • 34
0

I'd try to create a Playwright scraper that mimics a real user, this works for me most of the time, just need to find the right settings (they can vary from website to website). Otherwise, if the website has a native App, try to figure out how the App behaves and then mimic it.

0

I can suggest such workflow to "try" to avoid Cloudflare WAF/bot mitigation:

  • don't cycle user agents, proxies or weird tunnels to surf
  • don't use fixed ip addresses, better leased lines like xDSL, home links and 4G/LTE
  • try to appear as mobile instead of a desktop/tablet
  • try to reproduce pointer movements like never before AKA record your mouse moves and migrate them 1:1 while scraping (yes u need JS enabled and some headless browser able to make up as "common" one)
  • don't cycle against different Cloudflare protected entities otherwise the attacker ip will be greylisted in a minute (AKA build your own targets blacklist, never touch such entities or you will go in the CF blacklist flawlessy)
  • try to reproduce a real life navigation in all aspects, including errors, waitings and more
  • check your used ip after any scrape against popular blacklists otherwise bad errors will shortly appears (crowdsec is a good starting point)
  • the usual scrape is a googlebot scrape, a single regex WAF rule on CLoudflare will block 99,99% of the tries then.. avoid to fake as google and try to be LESS evil instead (ex: asking webmasters for APIs or data export if any).

Source: I use Cloudflare with hundreds of domains and thousands of records (Enterprise) from the beginning of the company.

That way you will be closer to the point (and you will help them increasing the overall security).

fab23
  • 59
  • 8
-1
import cfscrape
from fake_useragent import UserAgent
ua = UserAgent()

s = cfscrape.create_scraper()

k = s.post("https://author.today", headers = {"useragent": f"{ua.random}"})
print(k)
ZygD
  • 22,092
  • 39
  • 79
  • 102
Hello
  • 7
  • 1
-3

I used this line: scraper = cloudscraper.create_scraper(browser={'browser': 'chrome','platform': 'windows','mobile': False})

and then used httpx package after that with httpx.Client() as s: //Remaining Code

And I was able to bypass the issue cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

user2284144
  • 95
  • 1
  • 3
  • 2
    "and then used httpx package after that with httpx.Client() as s: //Remaining Code" Could you please elaborate on what exactly did you do with it and what is the "remaining code"? As is, this answer is unusable. – Kryomaani Jan 08 '23 at 17:45