4

I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request.

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""
req = SplashRequest(
    url,
    self.parse_page,
    args={
        'wait': 0.5,
        'lua_source': script,
        'endpoint': 'execute'
    }
)

The script is an exact copy from Splash documentation.

So I'm trying to access the cookies that are set on the webpage. When I'm not using Splash the code below works as I expect it to, but not when using Splash.

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

This returns while using Splash:

2017-01-03 12:12:37 [spider] DEBUG: Cookies: None

When I'm not using Splash this code works and returns the cookies provided by the webpage.

The documentation of Splash shows this code as example:

def parse_result(self, response):
    # here response.body contains result HTML;
    # response.headers are filled with headers from last
    # web page loaded to Splash;
    # cookies from all responses and from JavaScript are collected
    # and put into Set-Cookie response header, so that Scrapy
    # can remember them.

I'm not sure whether I'm understanding this correctly, but I'd say I should be able to access the cookies in the same way as when I'm not using Splash.

Middleware settings:

# Download middlewares 
DOWNLOADER_MIDDLEWARES = {
    # Use a random user agent on each request
    'crawling.middlewares.RandomUserAgentDownloaderMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    # Enable crawlera proxy
    'scrapy_crawlera.CrawleraMiddleware': 600,
    # Enable Splash to render javascript
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
}

So my question is: how do I access cookies while using a Splash request?

Settings.py

spider.py

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Casper
  • 1,435
  • 10
  • 22

2 Answers2

4

You can set SPLASH_COOKIES_DEBUG=True option to see all cookies which are being set. Current cookiejar, with all cookies merged, is available as response.cookiejar when scrapy-splash is configured correctly.

Using response.headers.get('Set-Header') is not robust because in case of redirects (e.g. JS redirects) there could be several responses, and a cookie could be set in the first, while script returns headers only for the last response.

I'm not sure if this is a problem you're having though; the code is not an exact copy from Splash docs. Here:

req = SplashRequest(
    url,
    self.parse_page,
    args={
        'wait': 0.5,
        'lua_source': script
    }
) 

you're sending request to the /render.json endpoint; it doesn't execute Lua scripts; use endpoint='execute' to fix that.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • I've added the endpoint to the request but with no result. The response.headers.get('Set-Cookie') still returns a NoneType. For the response.cookiejar I get an error: AttributeError: 'SplashTextResponse' object has no attribute 'cookiejar' – Casper Jan 04 '17 at 15:42
  • @Casper - are you sure all described options are set in your settings.py? Is `scrapy_splash.SplashCookiesMiddleware` added to `DOWNLOADER_MIDDLEWARES`? – Mikhail Korobov Jan 04 '17 at 22:30
  • I've updated the question with the DOWNLOADER_MIDDLEWARES settings variable. – Casper Jan 05 '17 at 10:20
  • The problem is that CrawleraMiddleware doesn't play well with Splash. With this middleware request is processed as `scrapy -> crawlera -> splash -> remote website`, while it should be `scrapy -> splash -> crawlera -> remote website`. To make them work together you need to adjust the script - see Crawlera+Splash docs: https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash – Mikhail Korobov Jan 05 '17 at 19:48
  • 1
    I've tried it without Crawlera (through `custom_settings = { 'CRAWLERA_ENABLED': False }` and also by commenting the middleware in settings.py ) but I still get the same error. I went over the configuration instruction again and I noticed I don't have the splash middleware in my settings: `SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }`. I tried to add it but got an error: `'scrapy_splash' doesn't define any object named 'SplashDeduplicateArgsMiddleware'` – Casper Jan 16 '17 at 09:13
  • what is your scrapy_splash version? SplashDeduplicateArgsMiddleware requires scrapy-splash v0.5+ – Mikhail Korobov Jan 16 '17 at 11:46
  • Ok, so also with this SplashDeduplicateArgsMiddleware I'm not able to access the cookies. I've added the spider and settings files in the question. – Casper Jan 16 '17 at 12:04
  • In you latest code you have CrawleraMiddleware enabled; it doesn't work with scrapy-splash middlewares - Crawlera integration should be done on Splash side, not on Scrapy side. See https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash and https://github.com/scrapy-plugins/scrapy-splash/issues/97. – Mikhail Korobov Jan 16 '17 at 12:52
  • That's correct, but I've tested it already while the middleware was commented. But I also disable Crawlera in the spider through custom settings so either way Crawlera is disabled. – Casper Jan 16 '17 at 13:00
  • Sorry, I'm not sure how to debug it. It seems your project has a lot of moving points (many middlewares, Crawlera, a different scrapy-splash version, cookies accessed using Set-Cookie header, initially there was a different endpoint). The setup described in scrapy-splash docs works for me on sites I tried if cookies are accessed using `response.cookiejar`. – Mikhail Korobov Jan 16 '17 at 17:28
0

You are trying to get the data from "static" headers sent by the server side but the js code in the page can generate cookies too. This explains why splash uses "splash:get_cookies()". To access the values from the "cookies" on response you should use the table returned by the lua script.

return {
   url = splash:url(),
   headers = last_response.headers,
   http_status = last_response.status,
   cookies = splash:get_cookies(),
   html = splash:html(),
}

Try to change this line

self.logger.debug('Cookies: %s', response.headers.get('Set-Cookie'))

to

self.logger.debug('Cookies: %s', response.cookies)
Franz Kurt
  • 1,020
  • 2
  • 14
  • 14