scrapy- how to stop Redirect (302)

Question

I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist.

Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx>

The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx exists, but http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197 doesn't, so the crawler cant find this. I've crawled many other websites as well but didn't have this problem anywhere else. Is there a way I can stop this redirect?

Any help would be much appreciated. Thanks.

Update: This is my spider class

class Inon_Spider(BaseSpider):
name = 'Inon'
allowed_domains = ['www.shop.inonit.in']

start_urls = ['http://www.shop.inonit.in/Products/Inonit-Gadget-Accessories-Mobile-Covers/-The-Red-Tag/Samsung-Note-2-Dead-Mau/pid-2656465.aspx']

def parse(self, response):

    item = DealspiderItem()
    hxs = HtmlXPathSelector(response)

    title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()
    price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()
    prc = price[0].replace("Rs.  ","")
    description = []

    item['price'] = prc
    item['title'] = title
    item['description'] = description
    item['url'] = response.url

    return item

akhter wahab · Answer 1 · 2021-01-25T10:37:17.797

22

yes you can do this simply by adding meta values like

meta={'dont_redirect': True}

also you can stop redirected for a particular response code like

meta={'dont_redirect': True,"handle_httpstatus_list": [302]}

it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them.

example

yield Request('some url',
    meta = {
        'dont_redirect': True,
        'handle_httpstatus_list': [302]
    },
    callback= self.some_call_back)

edited Jan 25 '21 at 10:37

answered Mar 18 '13 at 13:20

akhter wahab

4,045
1
25
47

1

Thanks for the response! but Im a bit confused as to where to put this line of code? I tried to overwrite the start_requests, but it gives me an error "Response' object has no attribute 'body_as_unicode'". Can we return an item and request at the same time? – user_2000 Mar 18 '13 at 15:10
You can call hxs = HtmlXPathSelector(response) with redirect you would have to test response.status ==302 and do another kind of processing. The hxs in that case will fail because response.body is empty for 302 status – Frederic Bazin Jan 01 '15 at 15:00
Has somebody tested? it is not working with the current scrapy version, I have tested with `'handle_httpstatus_list': [404, 301]` just 404 works – Jul 02 '15 at 20:07
It stops redirecting, but it also doesn't crawl content from the given pages too, any solutions? – Demonedge Oct 20 '15 at 06:45
You can put this code into start_requests method into spider class. When spider is executed firstly start in init method and then go to start_requests and in this point the request hasn't send yet. You can put this: `request = Request(url=self.start_urls[0], callback=self.parse) request.meta['dont_redirect'] = True return [request]` and then it will send the request and if it is ok will go to parse method, or callback method. – Pentux May 24 '16 at 10:22

score 12 · Answer 2 · edited Jan 09 '21 at 12:44

12

After looking at the documentation and looking through the relevant source, I was able to figure it out. If you look in the source for start_requests, you'll see that it calls make_requests_from_url for all URLs.

Instead of modifying start_requests, I modified make_requests_from_url

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True, meta = {
        'dont_redirect': True,
        'handle_httpstatus_list': [301, 302]
    })

And added this as part of my spider, right above parse().

edited Jan 09 '21 at 12:44

Evhz

8,852
9
51
69

answered Jan 14 '15 at 18:28

Chad Casey

181
1
6

I tried this but I still get redirected to the page I don't want – Demonedge Oct 20 '15 at 06:31
1

Lovely solution for the site I want to work with. Thanks! – zsljulius Aug 22 '17 at 02:48

score 8 · Answer 3 · edited May 30 '13 at 21:54

8

By default, Scrapy use RedirectMiddleware to handle redirection. You can set REDIRECT_ENABLED to False to disable redirection.

See documentation.

edited May 30 '13 at 21:54

alecxe

462,703
120
1,088
1,195

answered Apr 23 '13 at 03:23

imwilsonxu

2,942
24
25

3

I tried, If I set "REDIRECT_ENABLED=False", the scrapy stop redirect, and also can not get the html content. – house May 09 '14 at 10:21

Evhz · Answer 4 · 2021-01-09T12:41:08.570

3

As explained here: Scrapy docs

Use Request Meta

request =  scrapy.Request(link.url, callback=self.parse2)
request.meta['dont_redirect'] = True
yield request

edited Jan 09 '21 at 12:41

answered Dec 04 '15 at 17:24

Evhz

8,852
9
51
69

scrapy- how to stop Redirect (302)

4 Answers4

Linked