-1

I am scraping a web page of tech components and getting results to compare later. For this task, I am using Scrapy and Python. After two months scraping a web, I am getting 403 status error. I have tried to change:

  1. The bot name
  2. User Agent with some different agents
  3. Launch scraper from my friends computer
  4. I have tried to launch scraper in differents IP
  5. 3 and 4 together

This five steps make me think they have info about my scraper and not about my computer and they have blocked my bot. This is not the first time happens. They blocked my bot one month ago and unblocked the same bot a week later.

I am looking for fresh ideas because everybody on forums and scraping webs recommend to change user-agents.

I have tried to make a simple request with this code:

import request
 
url = 'https://www.webwithcloudflareprotection.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
    }

r = requests.get(url, headers=headers)
print(r.status_code)

This code is getting 403 always in every IP I try to launch it. It's very strange. Someone told me about Cloudfare but I don't know how to check if this software is behind all this.

Leon Lopez
  • 15
  • 1
  • 7
  • Try using proxies and VPN's. https://stackoverflow.com/questions/4710483/scrapy-and-proxies – patrickgerard Jun 08 '21 at 19:48
  • [403 Forbidden](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403) is administrative. Either the URL is unauthorized, or your IP/user-agent is. Either way, asking how to evade a 403 is outside the scope of a reasonable SO question. – Todd A. Jacobs Jun 08 '21 at 19:56

3 Answers3

1

Try to go to browser and make a request that your bot does. If the request wasn't rejected, get into developer tools and copy User-Agent header from you browser.

Also, here's something similar to your problem: HTTP error 403 in Python 3 Web Scraping

blazej
  • 927
  • 4
  • 11
  • 21
0

Finally, the problem was a third party software between my machine and their own IP. I find the way to avoid this integrating scrappy with Selenium and chrome driver.

It could not be the best solution but it works. Performance is slower but results are the same!

Leon Lopez
  • 15
  • 1
  • 7
-2

I solved this issue integrating Selenium and Scrapy. The problem is in cloudflare protection so VPN, proxies or user-agents do not resolve nothing.

The solution is to imitate a browser using Selenium and get the HTML to extract the info.

Leon Lopez
  • 15
  • 1
  • 7