0

How can I stop redirecting from a target url to another url which is a confirmation page of a website with a captcha?

Here is my code below:

yield scrapy.Request(meta={'handle_httpstatus_list': [302], 'dont_redirect': True, 'redirect_enabled':False},url=url, callback=self.profileCategoryPages)

Now it redirects me to another web page from a web page. I don't know why it is happening. It did not happen when I ran it for the first time, but when I ran for the second time, and ran it again and again, all I got is that it is redirect to another web page.

Tagget page: http://www.profilecanada.com/browse_by_category.cfm/

Redirected to this page: http://www.profilecanada.com/confirmReqPage.cfm

Thank you for your help!

RF_956
  • 329
  • 2
  • 7
  • 18
  • Could you post crawl log? You can do this via `scrapy crawl spider --logfile output.log` or `scrapy crawl spider 2>1 | tee output.log` commands (the later puts output to screen and file). You're probably not being redirected but the website has marked you as a bot and shows you catpcha-gated content because it doesn't trust you. – Granitosaurus Jul 27 '17 at 10:20
  • yes. Just figured out that I was blocked from accessing the website. Do you have any suggestions sir? Thank you. – RF_956 Aug 15 '17 at 08:17
  • It's a very broad issue. First you need to figure out why are you being captcha-gated. Why do they think you're a bot? Do your requests look human? Starting with checking user agent header and other headers is a good idea. Do they think you're a bot because you crawl to fast? Well then you need to add some delays or get some proxies. – Granitosaurus Aug 15 '17 at 12:53

1 Answers1

0

I think the reason that I am blocked is that I don't a delay value when requesting pages from a website. Also, I created the spider as a stand alone scraper program, thus, there is not settings.py that gonna be available to modify. What I did is this:

  1. Create a scraper as a projecy by running:

    scrapy startproject

  2. Added my previously created program scraper to the spider folder which is inside my newly created project

  3. Modiy the settings.py:

    DOWNLOAD_DELAY = , CONCURRENT_REQUESTS = 20, CONCURRENT_REQUESTS_PER_DOMAIN = 1, DOWNLOAD_TIMEOUT = 30

Now it works!

RF_956
  • 329
  • 2
  • 7
  • 18