19

I am trying to access the session cookie within a spider. I first login to a social network using in a spider:

    def parse(self, response):

        return [FormRequest.from_response(response,
                formname='login_form',
                formdata={'email': '...', 'pass':'...'},
                callback=self.after_login)]

In after_login, I would like to access the session cookies, in order to pass them to another module (selenium here) to further process the page with an authentificated session.

I would like something like that:

     def after_login(self, response):

        # process response
        .....

        # access the cookies of that session to access another URL in the
        # same domain with the autehnticated session.
        # Something like:
        session_cookies = XXX.get_session_cookies()
        data = another_function(url,cookies)

Unfortunately, response.cookies does not return the session cookies.

How can I get the session cookies ? I was looking at the cookies middleware: scrapy.contrib.downloadermiddleware.cookies and scrapy.http.cookies but there doesn't seem to be any straightforward way to access the session cookies.

Some more details here bout my original question:

Unfortunately, I used your idea but I dind't see the cookies, although I know for sure that they exists since the scrapy.contrib.downloadermiddleware.cookies middleware does print out the cookies! These are exactly the cookies that I want to grab.

So here is what I am doing:

The after_login(self,response) method receives the response variable after proper authentication, and then I access an URL with the session data:

  def after_login(self, response):

        # testing to see if I can get the session cookies
        cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
        cookieJar.extract_cookies(response, response.request)
        cookies_test = cookieJar._cookies
        print "cookies - test:",cookies_test

        # URL access with authenticated session
        url = "http://site.org/?id=XXXX"     
        request = Request(url=url,callback=self.get_pict)   
        return [request] 

As the output below shows, there are indeed cookies, but I fail to capture them with cookieJar:

cookies - test: {}
2012-01-02 22:44:39-0800 [myspider] DEBUG: Sending cookies to: <GET http://www.facebook.com/profile.php?id=529907453>
    Cookie: xxx=3..........; yyy=34.............; zzz=.................; uuu=44..........

So I would like to get a dictionary containing the keys xxx, yyy etc with the corresponding values.

Thanks :)

mikolune
  • 241
  • 1
  • 3
  • 5
  • Do i understand correctly that you want to authenticate on facebook, but scrape data from a different domain being authenticated on facebook? – warvariuc Jan 03 '12 at 07:25

4 Answers4

12

A classic example is having a login server, which provides a new session id after a successful login. This new session id should be used with another request.

Here is the code picked up from source which seems to work for me.

print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]

Code:

def check_logged(self, response):
tmpCookie = response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
cookieHolder=dict(SESSION_ID=tmpCookie)

#print response.body
if "my name" in response.body:
    yield Request(url="<<new url for another server>>",   
        cookies=cookieHolder,
        callback=self."<<another function here>>")
else:
    print "login failed"
        return 
spenibus
  • 4,339
  • 11
  • 26
  • 35
Ravi Ramadoss
  • 121
  • 1
  • 4
7

Maybe this is an overkill, but i don't know how are you going to use those cookies, so it might be useful (an excerpt from real code - adapt it to your case):

from scrapy.http.cookies import CookieJar

class MySpider(BaseSpider):

    def parse(self, response):

        cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
        cookieJar.extract_cookies(response, response.request)
        request = Request(nextPageLink, callback = self.parse2,
                      meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
        cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves

CookieJar has some useful methods.

If you still don't see the cookies - maybe they are not there?


UPDATE:

Looking at CookiesMiddleware code:

class CookiesMiddleware(object):
    def _debug_cookie(self, request, spider):
        if self.debug:
            cl = request.headers.getlist('Cookie')
            if cl:
                msg = "Sending cookies to: %s" % request + os.linesep
                msg += os.linesep.join("Cookie: %s" % c for c in cl)
                log.msg(msg, spider=spider, level=log.DEBUG)

So, try request.headers.getlist('Cookie')

warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • 2
    Many thanks for your answer! Unfortunately it still didn't work. Howver, I know for sure there are cookies. See post below for what I did. – mikolune Jan 03 '12 at 06:51
  • Please see my edit to the original post to see my response! Many thanks :) – mikolune Jan 03 '12 at 07:07
  • @mikolune, see the update. Also, learn to look into source code - that's why Python is good - you can look into source code, which is sometimes the best documentation. – warvariuc Jan 03 '12 at 07:32
  • Many thanks warvariuc. For my particular problem, I found a way around to not have to access the cookies (it had additional benefits as well). But this seems to be the solution. I will try anyway in a few days and let you know here how it went. – mikolune Jan 07 '12 at 01:31
1

This works for me

response.request.headers.get('Cookie')

It seems to return all the cookies that where introduced by the middleware in the request, session's or otherwise.

Way Too Simple
  • 275
  • 1
  • 2
  • 10
0

As of 2021 (Scrapy 2.5.1), this is still not particularly straightforward. But you can access downloader middlewares (like CookiesMiddleware) from within a spider via self.crawler.engine.downloader:

def after_login(self, response):
    downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
    cookies_mw = next(iter(mw for mw in downloader_middlewares if isinstance(mw, CookiesMiddleware)))
    jar = cookies_mw.jars[response.meta.get('cookiejar')].jar

    cookies_list = [vars(cookie) for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()]
    # or
    cookies_dict = {cookie.name: cookie.value for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()}

    ...

Both output formats above can be passed to other requests using the cookies parameter.

Ivan Lonel
  • 186
  • 4