3

When the javascript is loaded, it makes a another ajax request where cookies should be set in the response. However, Splash does not keep any cookies across multiple requests, is there a way to keep the cookies across all requests? Or even assign them manually between each requests.

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
James Samovar
  • 83
  • 1
  • 5

1 Answers1

4

Yes, there is an example in scrapy-splash README - see Session Handling section. In short, first, make sure that all settings are correct. Then use SplashRequest(url, endpoint='execute', args={'lua_source': script}) to send scrapy requests. Rendering script should be like this:

function main(splash)
    splash:init_cookies(splash.args.cookies)

    -- ... your script

    return {
        cookies = splash:get_cookies(),
        -- ... other results, e.g. html
    }
end

There is also a complete example with cookie handling, header handling, etc. in scrapy-splash README - see a last example here.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • Thanks for the help Mikhail, what happens when I need to set cookies for calls made in the javascript, 4 different requests happen when i do `splash:go(url)`, I would like to set cookies after the second request – James Samovar Nov 11 '16 at 20:04
  • Sorry, I don't quite understand the question. Cookies received in AJAX responses should be merged to Splash cookiejar and returned in splash:get_cookies(). splash:init_cookies() sets content of a browser cookiejar, browser should use these cookies for all requests, including AJAX requests. So the script above should work regardless of how many requests you're making in your Lua script. – Mikhail Korobov Nov 11 '16 at 20:52
  • Oh I understand now, so I guess the problem is not with the cookies. I'm basically trying to access Crunchbase.com through Splash, they have some weird bot protection. Accessing from a browser always works. Do you have any idea of how to make Splash's behavior exactly like a browser's? – James Samovar Nov 11 '16 at 21:08
  • Splash works like a browser, but a rather old one; it uses almost the same rendering engine as PhantomJS 2.0 - it is WebKit from 2013. It is possible to detect this engine using e.g. engine-specific bugs and gotchas, or using its missing features. It also sets user agent which can be identified (you can set your own though). – Mikhail Korobov Nov 11 '16 at 21:16
  • I see, appreciate your help! – James Samovar Nov 11 '16 at 22:06