1

I'm trying to use scrapy to scrape a site that uses javascript extensively to manipulate the document, cookies, etc (but nothing simple like JSON responses). For some reason I can't determine from the network traffic, the page I need comes up as an error when I scrape but not when viewed in the browser. So what I want to do is use webkit to render the page as it appears in the browser, and then scrape this. The scrapyjs project was made for this purpose.

To access the page I need, I had to have logged in previously, and saved some session cookies. My problem is that I cannot successfully provide the session cookie to webkit when it renders the page. There are two ways I could think to do this:

  1. use scrapy page requests exclusively until I get to the page that needs webkit, and then pass along the requisite cookies.
  2. use webkit within scrapy (via a modified version of scrapyjs), for the entire session from login until I get to the page I need, and allow it to preserve cookies as needed.

Unfortunately neither approach seems to be working.

Along the lines of approach 1, I tried the following: In settings.py --

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.middleware.WebkitDownloader': 701, #to run after CookiesMiddleware
}

I modified scrapyjs to send cookies: scrapyjs/middleware.py--

import gtk
import webkit
import jswebkit
#import gi.repository import Soup  # conflicting static and dynamic includes!?
import ctypes
libsoup = ctypes.CDLL('/usr/lib/i386-linux-gnu/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')

def process_request( self, request, spider ):
    if 'renderjs' in request.meta:
        cookies = request.headers.getlist('Cookie')
        if len(cookies)>0:
            cookies = cookies[0].split('; ')
            cookiejar = libsoup.soup_cookie_jar_new()
            libsoup.soup_cookie_jar_set_accept_policy(cookiejar,0) #0==ALWAYS ACCEPT
            up = urlparse(request.url)
            for c in cookies:
                sp=c.find('=') # find FIRST = as split position
                cookiename = c[0:sp]; cookieval = c[sp+1:];
                libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'None',-1))
                session = libwebkit.webkit_get_default_session()
                libsoup.soup_session_add_feature(session,cookiejar)

        webview = self._get_webview()
        webview.connect('load-finished', self.stop_gtk)
        webview.load_uri(request.url)
        ...

The code for setting the cookiejar is adapted from this response. The problem may be with how imports work; perhaps this is not the right webkit that I'm modifying -- I'm not too familiar with webkit and the python documentation is poor. (I can't use the second answer's approach with from gi.repository import Soup because it mixes static and dynamic libraries. I also can't find any get_default_session() in webkit as imported above).

The second approach fails because sessions aren't preserved across requests, and again I don't know enough about webkit to know how to make it persist in this framework.

Any help appreciated!

Community
  • 1
  • 1
jtbr
  • 1,149
  • 13
  • 13

2 Answers2

0

Actually, the first approach does work, but with one modification. The path to the cookies needs to be '/' (at least in my application), and not 'None' as in the code above. Ie, the line should be

libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'/',-1))

Unfortunately this only pushes the question back a bit. Now the cookies are saved properly, but the full page (including the frames) is still not being loaded and rendered with webkit as I had expected, and so the DOM is not complete as I see it in within the browser. If I simply request the frame that I want, then I get the error page instead of the content that is shown in a real browser. I'd love to see how to use webkit to render the whole page, including frames. Or how to achieve the second approach, completing the entire session in webkit.

jtbr
  • 1,149
  • 13
  • 13
0

Not knowing complete work flow of Ithe application, you need to make sure setting the cookie jar happens before any other network activity is done by webkit. http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-Global-functions.html#webkit-get-default-session. In my experience, this practically means even before instantiating the web view.

Another thing to check for is if your frames are from same domain.Cookie policies will not allow cookies across different domain.

Lastly, you can probably inject the cookies. See http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-webkitwebview.html#WebKitWebView-navigation-policy-decision-requested or resource-request-starting and then set the cookies on actual soup message.

user871199
  • 1,420
  • 19
  • 28