3

I am trying to get request status code with scrapy and scrapy-splash,below is spider code.

class Exp10itSpider(scrapy.Spider):
    name = "exp10it"

    def start_requests(self):
        urls = [
                'http://192.168.8.240:8000/xxxx' 
        ]
        for url in urls:
            #yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
            #yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
            yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
                'args': {
                    'html': 1,
                    'png': 1,
                    }
            }
            }
            )


    def parse(self, response):
        input("start .........")
        print("status code is:\n")
        input(response.status)

My start url http://192.168.8.240:8000/xxxx is a 404 status code url,there are threee kinds of request way upon:

the first is:

yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})

the second is:

yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})

the third is:

yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
            'args': {
                'html': 1,
                'png': 1,
                }
        }
        }
        )

Only the second request way yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True}) can get the right status code 404,the first and the third both get status code 200,that's to say,after I try to use scrapy-splash,I can not get the right status code 404,can you help me?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
quanyechavs huo
  • 125
  • 1
  • 13

1 Answers1

3

As the documentation to scrapy-splash suggests, you have to pass magic_response=True to SplashRequest to achieve this:

meta['splash']['http_status_from_error_code'] - set response.status to HTTP error code when assert(splash:go(..)) fails; it requires meta['splash']['magic_response']=True. http_status_from_error_code option is False by default if you use raw meta API; SplashRequest sets it to True by default.

EDIT: I was able to get it to work only with execute endpoint, though. Here is sample spider that tests HTTP status code using httpbin.org:

# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash

class HttpStatusSpider(scrapy.Spider):
    name = 'httpstatus'

    lua_script = """
    function main(splash, args)
      assert(splash:go(args.url))
      assert(splash:wait(0.5))
      return {
        html = splash:html(),
        png = splash:png(),
      }
    end
    """

    def start_requests(self):
        yield scrapy_splash.SplashRequest(
            'https://httpbin.org/status/402', self.parse,
            endpoint='execute',
            magic_response=True,
            meta={'handle_httpstatus_all': True},
            args={'lua_source': self.lua_script})

    def parse(self, response):
        pass

It passes the HTTP 402 status code to Scrapy, as can be seen from the output:

...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...

You can experiment with other HTTP status codes as well.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • I try to use below code with http_status_from_error_code=True,but still fails. ```yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True, 'splash': { 'args': { 'html': 1, 'png': 1, }, 'magic_response': True, 'http_status_from_error_code': True } } )``` – quanyechavs huo Oct 23 '17 at 03:24
  • Why don't you use `SplashRequest`? It's the recommended way of using Splash with Scrapy. – Tomáš Linhart Oct 23 '17 at 05:45
  • I try to use SplashRequest with below code,but still fails.`yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True}, meta={'handle_httpstatus_all': True}) ` Am I using it not correctly? – quanyechavs huo Oct 23 '17 at 05:49
  • Try `yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True}, magic_response=True, meta={'handle_httpstatus_all': True})` – Tomáš Linhart Oct 23 '17 at 05:51
  • `yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True}, magic_response=True, meta={'handle_httpstatus_all': True})` still fails,the return status code is 200 again,not 404 – quanyechavs huo Oct 23 '17 at 05:57
  • In your example url `https://httpbin.org/status/402`,the request result is:`response.status=402`,but `response.body` is not the correct content `Fuck you, pay me!`,can you help me? – quanyechavs huo Oct 23 '17 at 08:12
  • You reached my limit ;-) Don't know. – Tomáš Linhart Oct 23 '17 at 08:56