Scrapy: Redirecting to a confirmation page with a captcha

Question

How can I stop redirecting from a target url to another url which is a confirmation page of a website with a captcha?

Here is my code below:

yield scrapy.Request(meta={'handle_httpstatus_list': [302], 'dont_redirect': True, 'redirect_enabled':False},url=url, callback=self.profileCategoryPages)

Now it redirects me to another web page from a web page. I don't know why it is happening. It did not happen when I ran it for the first time, but when I ran for the second time, and ran it again and again, all I got is that it is redirect to another web page.

Tagget page: http://www.profilecanada.com/browse_by_category.cfm/

Redirected to this page: http://www.profilecanada.com/confirmReqPage.cfm

Thank you for your help!

Could you post crawl log? You can do this via `scrapy crawl spider --logfile output.log` or `scrapy crawl spider 2>1 | tee output.log` commands (the later puts output to screen and file). You're probably not being redirected but the website has marked you as a bot and shows you catpcha-gated content because it doesn't trust you. — Granitosaurus, Jul 27 '17 at 10:20
yes. Just figured out that I was blocked from accessing the website. Do you have any suggestions sir? Thank you. — RF_956, Aug 15 '17 at 08:17
It's a very broad issue. First you need to figure out why are you being captcha-gated. Why do they think you're a bot? Do your requests look human? Starting with checking user agent header and other headers is a good idea. Do they think you're a bot because you crawl to fast? Well then you need to add some delays or get some proxies. — Granitosaurus, Aug 15 '17 at 12:53

score 0 · Answer 1 · answered Aug 18 '17 at 07:51

I think the reason that I am blocked is that I don't a delay value when requesting pages from a website. Also, I created the spider as a stand alone scraper program, thus, there is not settings.py that gonna be available to modify. What I did is this:

Create a scraper as a projecy by running:

scrapy startproject
Added my previously created program scraper to the spider folder which is inside my newly created project
Modiy the settings.py:

DOWNLOAD_DELAY = , CONCURRENT_REQUESTS = 20, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_TIMEOUT = 30

Now it works!

Scrapy: Redirecting to a confirmation page with a captcha

1 Answers1