Getting 403 after two months scraping with scrapy

Question

I am scraping a web page of tech components and getting results to compare later. For this task, I am using Scrapy and Python. After two months scraping a web, I am getting 403 status error. I have tried to change:

The bot name
User Agent with some different agents
Launch scraper from my friends computer
I have tried to launch scraper in differents IP
3 and 4 together

This five steps make me think they have info about my scraper and not about my computer and they have blocked my bot. This is not the first time happens. They blocked my bot one month ago and unblocked the same bot a week later.

I am looking for fresh ideas because everybody on forums and scraping webs recommend to change user-agents.

I have tried to make a simple request with this code:

import request
 
url = 'https://www.webwithcloudflareprotection.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
    }

r = requests.get(url, headers=headers)
print(r.status_code)

This code is getting 403 always in every IP I try to launch it. It's very strange. Someone told me about Cloudfare but I don't know how to check if this software is behind all this.

Try using proxies and VPN's. https://stackoverflow.com/questions/4710483/scrapy-and-proxies — patrickgerard, Jun 08 '21 at 19:48
[403 Forbidden](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403) is administrative. Either the URL is unauthorized, or your IP/user-agent is. Either way, asking how to evade a 403 is outside the scope of a reasonable SO question. — Todd A. Jacobs, Jun 08 '21 at 19:56

score 1 · Answer 1 · answered Jun 08 '21 at 19:47

1

Try to go to browser and make a request that your bot does. If the request wasn't rejected, get into developer tools and copy User-Agent header from you browser.

Also, here's something similar to your problem: HTTP error 403 in Python 3 Web Scraping

answered Jun 08 '21 at 19:47

blazej

927
4
11
21

I can request from my browser and there is not problem, but my browsers user-agent does not work in the scraper. – Leon Lopez Jun 08 '21 at 19:55
That's weird. What library do you use for web scrapping? – blazej Jun 08 '21 at 19:58
I am using scrapy. I tried to request and I am getting 200 status. – Leon Lopez Jun 08 '21 at 19:59
Is scrapy has a method that prints out all request headers? If so, try to compare reqests library headers with scrrapy headers. – blazej Jun 08 '21 at 20:02
it's the same header, it shows all bot's config in shell before start scraping – Leon Lopez Jun 08 '21 at 20:05
1

Have you read this part of scrapy documentation? https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned – blazej Jun 08 '21 at 20:12

score 0 · Accepted Answer · answered Jul 26 '21 at 07:52

Finally, the problem was a third party software between my machine and their own IP. I find the way to avoid this integrating scrappy with Selenium and chrome driver.

It could not be the best solution but it works. Performance is slower but results are the same!

score -2 · Answer 3 · answered Jun 17 '21 at 10:38

-2

I solved this issue integrating Selenium and Scrapy. The problem is in cloudflare protection so VPN, proxies or user-agents do not resolve nothing.

The solution is to imitate a browser using Selenium and get the HTML to extract the info.

answered Jun 17 '21 at 10:38

Leon Lopez

15
1
7

Can you post your solution? – tommy.carstensen Jul 25 '21 at 03:51

Getting 403 after two months scraping with scrapy

3 Answers3