22

My problem is opening sites on CloudFlare in CLI.
I do not mean when there is a challenge and I do not want to solve the challenge.

Consider this site as an example: https://pegaxy.io
When opened for the first time on a newly installed any web browser. It opens without any problems. Code 200 is received.

open with web browser

But when I click Copy as cURL and get a 403 error in the terminal.

open in terminal

CURL code:

curl 'https://pegaxy.io/' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Connection: keep-alive' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-Site: none' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  --compressed --verbose

Log:

$ curl 'https://pegaxy.io/' \
>   -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0' \
>   -H 'Upgrade-Insecure-Requests: 1' \
>   -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
>   -H 'Accept-Language: en-US,en;q=0.5' \
>   -H 'Connection: keep-alive' \
>   -H 'Upgrade-Insecure-Requests: 1' \
>   -H 'Sec-Fetch-Dest: document' \
>   -H 'Sec-Fetch-Mode: navigate' \
>   -H 'Sec-Fetch-Site: none' \
>   -H 'Sec-Fetch-User: ?1' \
>   -H 'Pragma: no-cache' \
>   -H 'Cache-Control: no-cache' \
>   --compressed --verbose
*   Trying 172.67.10.157:443...
* Connected to pegaxy.io (172.67.10.157) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: C:/Program Files/Git/mingw64/ssl/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.pegaxy.io
*  start date: Mar  3 05:22:24 2022 GMT
*  expire date: Jun  1 05:22:23 2022 GMT
*  subjectAltName: host "pegaxy.io" matched cert's "pegaxy.io"
*  issuer: C=US; O=Let's Encrypt; CN=E1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x671500)
> GET / HTTP/2
> Host: pegaxy.io
> accept-encoding: deflate, gzip
> user-agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0
> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
> accept-language: en-US,en;q=0.5
> connection: keep-alive
> upgrade-insecure-requests: 1
> sec-fetch-dest: document
> sec-fetch-mode: navigate
> sec-fetch-site: none
> sec-fetch-user: ?1
> pragma: no-cache
> cache-control: no-cache
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403
< date: Fri, 18 Mar 2022 13:03:04 GMT
< content-type: text/html; charset=UTF-8
< cache-control: max-age=15
< expires: Fri, 18 Mar 2022 13:03:19 GMT
< x-frame-options: SAMEORIGIN
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< set-cookie: __cf_bm=rlU7vb3eTQzw02vcpzOo6gweMMadJXkNxsft3MqPLSY-1647608584-0-AXk3yx3EOmlDZ+tIGWB3S+1ud6hWmykBwT7IwKtO+e+eCdY36JjTgyM3SkdIyBeWvtZphzvnBZLCVE4R6YogbxI=; path=/; expires=Fri, 18-Mar-22 13:33:04 GMT; domain=.pegaxy.io; HttpOnly; Secure; SameSite=None
< vary: Accept-Encoding
< server: cloudflare
< cf-ray: 6ede2a920d7392c5-FRA
< content-encoding: gzip
<
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>


<!--[if gte IE 10]><!-->
<script>
  if (!navigator.cookieEnabled) {
    window.addEventListener('DOMContentLoaded', function () {
      var cookieEl = document.getElementById('cookie-alert');
      cookieEl.style.display = 'block';
    })
  }
</script>
<!--<![endif]-->


</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1 data-translate="block_headline">Sorry, you have been blocked</h1>
        <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> pegaxy.io</h2>
      </div><!-- /.header -->

      <div class="cf-section cf-highlight">
        <div class="cf-wrapper">
          <div class="cf-screenshot-container cf-screenshot-full">

              <span class="cf-no-screenshot error"></span>

          </div>
        </div>
      </div><!-- /.captcha-container -->

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>

            <p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting
a certain word or phrase, a SQL command or malformed data.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="blocked_resolve_headline">What can I do to resolve this?</h2>

            <p data-translate="blocked_resolve_detail">You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.</p>
          </div>
        </div>
      </div><!-- /.section -->

      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">6ede2a920d7392c5</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Your IP</span>: 46.62.217.20</span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
</div><!-- /.error-footer -->


    </div><!-- /#cf-error-details -->
  </div><!-- /#cf-wrapper -->

  <script type="text/javascript">
  window._cf_translation = {};


</script>

</body>
</html>
* Connection #0 to host pegaxy.io left intact

Tested on windows and linux.

$ curl -V
curl 7.70.0 (x86_64-w64-mingw32) libcurl/7.70.0 OpenSSL/1.1.1g (Schannel) zlib/1.2.11 libidn2/2.3.0 libssh2/1.9.0 nghttp2/1.40.0
Release-Date: 2020-04-29
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp
Features: AsynchDNS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz Metalink MultiSSL NTLM SPNEGO SSL SSPI TLS-SRP

Note that talking about is exactly the first request. So this can not be related to cookies or a piece of JavaScript code that checks the status in the browser.

Also I'm even aware of CloudFlare's sensitivities about the IP and rate of requests, but these are for when the requests are consecutive and confusing, and eventually the challenge is displayed. But in my tests, the same IP on the web is not a problem and only the first request occurs.

No problem on Chrome, Firefox, Edge, Brave, Tor web browsers.
But there is a problem with CURL, wget, nghttp, lynx on the command line. I also tested several nodejs packages that had problems.

Question: When all the conditions are the same, how does CloudFlare find out that the request is not from the browser, and how can it be simulated or bypassed on the command line without the use of a web browser?

Keep in mind that I know about Selenium, Flaresolverr, pupflare, etc. and I do not intend to use the browser because they render the page and slow down the operation.

Things I did and did not get an answer:

  • I thought maybe the problem is the lack of push server http2 support in the CURL command line. So I wrote it in PHP and implement push server in it, but the problem was not solved.
  • I thought the problem was with the certificate, so I downloaded the browser certificate and converted it to pem file and used it in CURL, but the problem was not solved.

I just want to get the same 200 response code on the terminal in the first case if I getting the code 200 in the first request on the web browser!

Machavity
  • 30,841
  • 27
  • 92
  • 100
Nabi K.A.Z.
  • 9,887
  • 6
  • 59
  • 81
  • Cloudflare doesn't publish exactly what their Browser Integrity Check does, because if they did it would be easy to figure out how to bypass. The owner of the website you are trying to scrape has specifically enabled Browser Integrity Check in their Cloudflare firewall to prevent people from scraping their website, just as you are trying to do. – Mark B Mar 18 '22 at 19:41
  • 1
    @MarkB I'm not looking for an official method that has been officially released by Cloudflare. However, Cloudflare has used the same conventional technologies and no new technology has come from another planet. The method of sending data on the web platform by browsers and tcp/ip protocols, etc. have all become standard, and Cloudflare uses the same. So I think it's important to understand what Cloudflare is sensitive to. And there is no secret. You just have to review all the technical aspects. – Nabi K.A.Z. Mar 18 '22 at 19:50

2 Answers2

16

Cloudflare uses various techniques to determine whether the user agent is a real browser or not. And, the site owner can also determine the level of risk they can allow via the Cloudflare platform.

Let's discuss a few techniques (I know) used by Cloudflare:

  1. TLS fingerprinting This is one of the prominent techniques used notoriously by Cloudflare. This is also the reason why tools like native proxy are popular. Link: https://github.com/klzgrad/naiveproxy

  2. Cookies Cloudflare used to have some cf_ related cookies which are used to distinguish real users or not.

And, these are only a few techniques. Cloudflare has many more.

And, this issue is not just limited to Cloudflare, China firewall is also notorious for using such modus operandi to distinguish various things.

Shirshak55
  • 508
  • 3
  • 12
  • As I said, I'm sure don't related to cookies because we talk about first request in the new browser. But about fingerprint, do you think can export or see or anyway use of that for self request out of chrome? – Nabi K.A.Z. Mar 19 '22 at 14:31
  • 9
    @NabiK.A.Z.You can, but its not that easy. See https://curl.se/libcurl/c/CURLOPT_SSL_CTX_FUNCTION.html. The better choice as of now seems to be curl impersonate. Try it. https://github.com/lwthiker/curl-impersonate – Shirshak55 Mar 20 '22 at 02:38
  • 3
    The answer was `curl-impersonate`. I think it would be good if you add it to the answer. However, due to the complexity of these methods, I used the same `puppeteer`(https://github.com/puppeteer/puppeteer) for my project and got a good answer. – Nabi K.A.Z. May 06 '22 at 18:18
  • The `curl-impersonate` was fine, but the problem was that it did not support websocket. In fact, curl was not well supported. https://github.com/lwthiker/curl-impersonate/issues/41 – Nabi K.A.Z. May 06 '22 at 18:43
4

Since the question does not suggest the language in which the user wants to bypass cloudflare protection, I will provide the code for node.js:

Libs:

npm i puppeteer-extra puppeteer-extra-plugin-stealth puppeteer

nodejs:

const puppeteer = require('puppeteer-extra')
const pluginStealth = require('puppeteer-extra-plugin-stealth')
const { executablePath } = require('puppeteer')

const link = 'https://pegaxy.io/'

const getHtmlThoughCloudflare = async (url) => {
  puppeteer.use(pluginStealth())
  const result = await puppeteer
    .launch({ headless: true })
    .then(async (browser) => {
      const page = await browser.newPage()
      await page.goto(url)
      const html = await page.content()
      await browser.close()
      return html
    })

  console.log(` HTML: ${result}`)
  return result // html
}

getHtmlThoughCloudflare(link)
Roma N
  • 175
  • 11