1

I am trying to scrape products from mediamarkt site with Colly. Here is my code:

func WebScraper(allowedDomain string, page string, htmlElement string, htmlTag string) {
    /*
        Order in which Collector's callbacks are executed in:
        1. OnRequest  -> Called before a request
        2. OnError    -> Called if error occured durig the request
        3. OnResponse -> Called after response received
        4. OnHTML     -> Called right after OnResponse if the received content is HTML
        5. OnXML      -> Called right after OnHTML if the recieved content is HTML or XML
        6. Scraped    -> Called after OnXML callback
    */
    c := colly.NewCollector(
        // MaxDepth is 2, so only the links on the scraped page
        // and links on those pages are visited
        colly.AllowedDomains(allowedDomain),
        colly.MaxDepth(2),
        colly.Async(true),
    )

    // Limit the maximum parallelism to 2
    // This is necessary if the goroutines are dynamically
    // created to control the limit of simultaneous requests.
    //
    // Parallelism can be controlled also by spawning fixed
    // number of go routines.
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})

    // Step 2. Perform some logic before REQUEST Is made
    c.OnRequest(func(r *colly.Request) {
        app.InfoLog.Println("Visiting ", r.URL.String())
    })

    // Step 2.1. If errror occurred during the request, handle it!
    c.OnError(func(r *colly.Response, err error) {
        app.ErrorLog.Println("Request URL: ", r.Request.URL, " failed with response: ", r, "\nError: ", err)
    })

    // On every a element which has href attribute call callback
    c.OnHTML(htmlElement, func(e *colly.HTMLElement) {
        app.InfoLog.Println(e.ChildText(htmlTag))
    })

    c.Visit(page)
    // Wait until threads are finished
    c.Wait()
}

I've already tried scraping Wikipedia and some other sites, and it works. But here, I am getting 403 Forbidden error. Here is HEADER from RESPONSE:

Permissions-Policy : [accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()]
Expires : [Thu, 01 Jan 1970 00:00:01 GMT]
Set-Cookie : [__cf_bm=eEhiHiAsyTUuG7Ra4_rGhBWBHGxP_FWphwxEIl66hW8-1654161057-0-Aef4Vr6ypA0zr8CVP66c2x9X1s+vUcusYPkMqJR3MhpLt/FxMHi+GXMD0+YEcb2L/cLC6RVhgROG9gOvXVTjQMIYUjwyvfi1/hFvAPthwzC/; path=/; expires=Thu, 02-Jun-22 09:40:57 GMT; domain=.mediamarkt.de; HttpOnly; Secure; SameSite=None]
Vary : [Accept-Encoding]
Date : [Thu, 02 Jun 2022 09:10:57 GMT]
Expect-Ct : [max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"]
Content-Type : [text/html; charset=UTF-8]
Cache-Control : [private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0]
Server : [cloudflare]
Cf-Ray : [714f0f0e3b881c23-SOF]
X-Frame-Options : [SAMEORIGIN]
Strict-Transport-Security : [max-age=15897600]
X-We-Are-Hiring : [We appreciate developers that love to explore what goes on under the hood of software. Apply now at https://careers.mediamarktsaturn.com/MediaMarktSaturn!]

And here is the Body of the RESPONSE:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>

<title>Please Wait... | Cloudflare</title>
  
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
<style>body{margin:0;padding:0}</style>


<!--[if gte IE 10]><!-->
<script>
  if (!navigator.cookieEnabled) {
    window.addEventListener('DOMContentLoaded', function () {
      var cookieEl = document.getElementById('cookie-alert');
      cookieEl.style.display = 'block';
    })
  }
</script>
<!--<![endif]-->


    <script>
    //<![CDATA[
    (function(){
      window._cf_chl_opt={
        cvId: "2",
        cType: "managed",
        cNounce: "41590",
        cRay: "714f0f0e3b881c23",
        cHash: "7549f8b7d78a2a4",
        cUPMDTk: "\/de\/category\/smartphones-579.html?__cf_chl_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU",
        cFPWv: "g",
        cTTimeMs: "1000",
        cLt: "n",
        cRq: {
          ru: "aHR0cHM6Ly93d3cubWVkaWFtYXJrdC5kZS9kZS9jYXRlZ29yeS9zbWFydHBob25lcy01NzkuaHRtbA==",
          ra: "Y29sbHkgLSBodHRwczovL2dpdGh1Yi5jb20vZ29jb2xseS9jb2xseQ==",
          rm: "R0VU",
          d: "vr3pEux85BB4TszTDjAPScZq2oMqIA1GoFOPEjftlymNdbnhggazvYIWsXBQOTzYsqm6B1QxUgRJqK2CNemXc9VqLj70rk1vMXKFsNRn8eSkCfbX1bVvJbp+S3YSI+zdrPmzOiiq4gO2vWm5pOKlKc+7qmux89XYc1J0YnOprUgYdHNeayUheiiXkRqwPQqW/cY1+5C2IsPzqzcU7M7YCnWjenwMn1pjLFjMclUxEi6s/gu5lLTr8HSnalidGwSVexGj4SBqmKekU99FZqEtE5kJutfFoUEiwuEJmmo7QrYuWrXRfB80Fms3xVWa8J6Ga4M9cnJgv3PP9qRucyj01EtAlfkpx7coaUfTJue65CZcHA4SJcB7WqMHdaUVojdSFsc4UoCYGbnstK2lyuX+v6GAC2GGOtK23s8DcfcB/YJsCChlpkURsIfnGbzmfI5cQf5JqWkhnW6p1UG3oKs7bec/dUNKL+XJjRH0rvyvKFkMX6Ca/0FX00zR0a1WcxnXOhU1iZzQOR2U/ZrXvfE0jeFCRQ+OHvCd0Ncfosas5axWsibMU+MeasO+bYbG8hTjHgvG8+tFc0tYII+nbVWFp44k+mWOBIhKh951P8TAoLl1h4HO9+hxKdpjQGAtjeZJ39oc3daC5julK9RJOng8Hw==",
          t: "MTY1NDE2MTA1Ni45OTkwMDA=",
          m: "cZC1J0+WAKjb0r4I8GxqyYnUTcVqCk2O4D12RYxeP7Q=",
          i1: "90OzQhzN+BROMhNBF2EFBw==",
          i2: "grkPyoRifg7B+X0FEjpHHQ==",
          zh: "q1ZR4e29hYz+cTx2o5UYJG1hFifFh0loDJNTfBOG7gU=",
          uh: "DaHp0r0NTdLobcNE2+1UVaN6g6tbXcsPQKHJoB7xdZI=",
          hh: "+dgxVyY+fQBum8yrY3Q9pqqEvjydD2WPU3jRaUrPF1o=",
        }
      };
    }());
    //]]>
    </script>

<style>
  #cf-wrapper #spinner {width:69px; margin:  auto;}
  #cf-wrapper #cf-please-wait{text-align:center}
  .attribution {margin-top: 32px;}
  .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
  #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
  #cf-hcaptcha-container { text-align:center;}
  #cf-hcaptcha-container iframe { display: inline-block;}
  @keyframes fader     { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
  #cf-wrapper #cf-bubbles { width:69px; }
  @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
  #cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
  #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
  #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
</style>
</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
      
        <h1 data-translate="managed_challenge_headline">Please wait...</h1>
        <h2 class="cf-subheadline"><span data-translate="managed_checking_msg">We are checking your browser...</span> www.mediamarkt.de</h2>
      
      </div>
      
      <div class="cf-section cf-highlight cf-captcha-container">
        <div class="cf-wrapper">
          <div class="cf-columns two">
            <div class="cf-column">
            
              <div class="cf-highlight-inverse cf-form-stacked">
                <form class="challenge-form managed-form" id="challenge-form" action="/de/category/smartphones-579.html?__cf_chl_f_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU" method="POST" enctype="application/x-www-form-urlencoded">
    <div id='cf-please-wait'>
      <div id='spinner'>
        <div id="cf-bubbles">
            <div class="bubbles"></div>
            <div class="bubbles"></div>
            <div class="bubbles"></div>
        </div>
      </div>
      <p data-translate="please_wait" id="cf-spinner-please-wait">Please stand by, while we are checking your browser...</p>
      <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...</p>
      </div>
  <input type="hidden" name="md" value="u0AdAefiQaOd5cct_8y26o7DHt3en_YcDPYT5F0ABUY-1654161057-0-ATANjzlyezjgr7F1BeHeI_j_uUY38_a79__nKHeOV0Dk2cJOfgMdCTl3WoYsPTD7L25TEyF0Zu27FsSj21OI2aeiNSKmAbPirtvQwqJkPR_knETzvfp75Sv1rnhXV_52btLnozXuVO3Y_z7ElYk1CZDJDEdTw8Eu-MLyEaxyZGJHxx9Tk58hP1NPpWzN98aAcbhY0L1Au8IvJiH8bVmaRlLhK2KDOcXgM7KFONTOuo5-vGZjUjtE4YbUadBFGqk8jIZTRrIXZmwIZNm7TiPlPBwAz8POM7Rw_uoL7THpV4QUctlXigEqRHrY4g-jLcJEW-uZZm2qVMpzbAFOQjJ6UvkY_RC25ZQ5L0MQr1Nnh32-OQZctZIhj8edoK1TZasOXT6u0bT5lOecpx2j82H8mF59qM_zfUbIs4H6wvEx0prqNpEu-4Z7_x1y_agGnVMtW-2OCpKPjcmn9j1-NZnZYdJbrqTzdn2j6qe-wnn3RuRSna8DnN-W7AQTCS4vn7uYc76FWBFERMIwczuHUk-KrOof_TpwA324htdvh4I7URUK8CxdSCZqdG7UfsKbjgdLStciaw_PGDud2rPsE2hQEClxPXFsbcWju8aM6BDmlxQFJm7KJHZcbJTtA8yPMfgha4EvOTTGrEwaBy16B4U18Tmo9JXUlBUJwzbtBXMxfZ0XVQWu709nvxwpWAMZb8kEPND5aXQi2jEiGZZnM3wx_JlXtxPlBiTsxP5mEJ5pf8a71v1aZAzWUcPAaHtRymR8a92yWS4Z57h4a2HSchUf8LlFiuoogFCLBNEi2IoYTuIWFhww1k1UEhjuUZ2h21G4149DN5k-xfRY53H4EyHRs30oYiABowol3n3te3kZcPwB" />
  <input type="hidden" name="r" value="919uegZGZgLhriycM0_XCKz1utWQOqsLyAsDF2mLcEY-1654161057-0-AXs4pyKaoppndTN8hJ/khCNIxpye10VI2waeNLb4xYndXBU8rLwkuUXxzAWPTOMPsGwR0KAe5aERtjPvehE5pESDCLcHgGq/H6RUBimtjqQMbxRS8fCyoLrV89WrqAv7Okw3Y+i048El6jKYonunXSU7zzKNR/EL8DIe8/qP47CVRqyOxIDJ2pVHq4GwnfXBtiiWpr4z49jikhah7wbqwOALXPYP4WYlFPrk1kZ1+VgBhEf3RtsybLxsR3E8UagLgTf4K+yNUAt+Uzmi+1qvE2oTq8cVRWZ+gBiXsmRKkWnn3hg6qg9h0DPF8X0U+h8ufqBiTIT3/Lb2M8f1/bB1Sjr6ZBo08ZO5lkGvqdx08L6TRwv5MT4yDWrubtXpZL4Dkpw0yuvLJjonxLMdoF7laSt+xW0VP7ZmAPCNBfY89CXhTqnj/78w0GiLvIFjb9kiNk7cnofy1erkGrI2e/rO6HomogGJT0kGb7V5t6HBOU8mW+4JraBqv1rYLpqv7XmPh4cqjr9DJ8iDDGcqxMciL9VWT4g0nTNlipr0JoVv7L1F36+0Yc+5FuIJwvhvIXN64LlK2vyroKNE/wu3r5O9RWVgAToNI2KlZAbJaHFCBBAhDRdDi7EaVZVoNhmA3Ju+YiNXmGJ5L21MWLwX+N9jQP1KRibF3ixAzObVKTlGmAWUQLdfrc98pHn8oDI1cpCWzrhrsdAQImLLMEO49lJQnmvWpF+lP8iULAiJG4pdsZ5dIelChc7f4W51l0bAUvL/2l/lJg7/qLxFd5PqJp8Jo7nzqbgibEvM8/55/A3wtT9WX0kJp2Da8Kez0UzrgKeAb3VdGVrHwr+k1eJ4o3fI/RBesr/aWkbgjk4EM8itKypPg/c1Ejd9h/Kn89EpeJPtgz7t+vxDyH47kzmR0L9+gWOd5UBvVel/KzwxAxpuO0fw/tNYbEO0vJ2A3NWThWuS2g34K60w+y+Tp/TrNw/yrQH6wVUUsYESQCc2ZLkt8aVRPR30GuKuC9Zjaj8C8g3ywF5EDvFPYm9ZSPjayGyW3magUchBTngl9HJTiAADmSJB8sJfFWWNVJzKP8e7QRYdGbZzy+EiKzEUN61jWlCKlhFKFIwZlCZBIQ+TYL4+ukePHWoUgttIef21cFjy/ydCoznkJDPtceQDPNyCJZHBv2ljXGJ/IpPZ3CcLW9mAVOdjorEitBUY5ObbZTnpgFelrEKo9SVuE4tSawF7ba0TBcUR7yQXKcB6xmrsdlpn0Bp2Ki7rm8XnIGcK34U2+SQ2FrVaBEHTWW3vFWcdyfQmPPoD8BQo/to3Vt3Lz3K2RC8Ugh6bDzzD61z+6d1iWJ2qIyostZIvVQoPwNqdhYrWw9eBF4DF4COCxIoA16S9TLaEqSV+5e+fBfoRVw+jmsi0qRWkYbtBI0imU7f99EEIdP4y6sz+3LeHLUufXvHHWZoT2URjpCZSXJfhnYYg77qSZbIDX5z0RcnBpGBjiISfAwpfUpwp1SPe5fqB0rka6hvGektNSI+YgSPsI8mfH4CNh2dnaxN0OJzj64zaEWKJYrG3Jzhmip7RBJ7v7utJqqLQu6EWIfJ2b8vV314ucEgB9ORIjARY0Zb/Lx7/Jzrt4wvlsuEhySPHb7TylWO1Gyra">
  <input type="hidden" name="vc" value="22dd9a5e4ec44559e78aa0e010d110ca">
  <noscript id="cf-captcha-bookmark" class="cf-captcha-info">
  <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
  </noscript>
    <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
      <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
    </div>
  <script>
  //<![CDATA[
    var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
      b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
      b(function(){
        var cookiesEnabled=(navigator.cookieEnabled)? true : false;
        if(!cookiesEnabled){
          var q = document.getElementById('no-cookie-warning');q.style.display = 'block';
        }
      });
  //]]>
  </script>
  <div id="trk_captcha_js" style="background-image:url('/cdn-cgi/images/trace/captcha/nojs/h/transparent.gif?ray=714f0f0e3b881c23')"></div>
</form>
  <script>
    //<![CDATA[
    (function(){
        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
        var trkjs = isIE ? new Image() : document.createElement('img');
        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=714f0f0e3b881c23");
        trkjs.id = "trk_managed_js";
        trkjs.setAttribute("alt", "");
        document.body.appendChild(trkjs);
        var cpo=document.createElement('script');
        cpo.type='text/javascript';
        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=714f0f0e3b881c23";
        
        window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;
        window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, -window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;
        if (window._cf_chl_opt.cUPMDTk && window.history && window.history.replaceState) {
          var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
          history.replaceState(null, null, "\/de\/category\/smartphones-579.html?__cf_chl_rt_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU" + window._cf_chl_opt.cOgUHash);
          cpo.onload = function() {
            history.replaceState(null, null, ogU);
          };
        }
        
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
    //]]>
    </script>


              </div>
            </div>

            <div class="cf-column">
              <div class="cf-screenshot-container">
              
                <span class="cf-no-screenshot"></span>
              
              </div>
            </div>
          </div>
        </div>
      </div>

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
            
            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
            

            <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>

            <p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>
            
              
            
          </div>
        </div>
      </div>
      

      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">714f0f0e3b881c23</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Your IP</span>: 178.221.155.142</span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>
    
  </p>
</div><!-- /.error-footer -->


    </div>
  </div>

  <script>
  window._cf_translation = {};
  
  
</script>


</body>
</html>

It looks like some sort of CAPTCHA or JS issue, but I cannot figure out how to avoid it. Any advice?

Stefan Radonjic
  • 1,449
  • 4
  • 19
  • 38
  • If you scroll down, the response makes the situation pretty clear: "Completing the CAPTCHA proves you are a human". The owner of the website doesn't want you to scrape their content. Maybe they have an API they want you to use instead, or maybe they just don't want you using their data that way. – IMSoP Jun 02 '22 at 09:20
  • I wanted to scrape products for ML dataset, but yeah, seems I can't do it. – Stefan Radonjic Jun 02 '22 at 09:21
  • Contact the website owner if you think they might be OK with it. But asking "how do I avoid this CAPTCHA to do exactly the thing it's there to protect against" is similar to asking "how do I get past this locked door because I really want to see what's on the other side". – IMSoP Jun 02 '22 at 09:27

0 Answers0