I'm trying to use scrapy to scrape a site that uses javascript extensively to manipulate the document, cookies, etc (but nothing simple like JSON responses). For some reason I can't determine from the network traffic, the page I need comes up as an error when I scrape but not when viewed in the browser. So what I want to do is use webkit to render the page as it appears in the browser, and then scrape this. The scrapyjs project was made for this purpose.
To access the page I need, I had to have logged in previously, and saved some session cookies. My problem is that I cannot successfully provide the session cookie to webkit when it renders the page. There are two ways I could think to do this:
- use scrapy page requests exclusively until I get to the page that needs webkit, and then pass along the requisite cookies.
- use webkit within scrapy (via a modified version of scrapyjs), for the entire session from login until I get to the page I need, and allow it to preserve cookies as needed.
Unfortunately neither approach seems to be working.
Along the lines of approach 1, I tried the following: In settings.py --
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.middleware.WebkitDownloader': 701, #to run after CookiesMiddleware
}
I modified scrapyjs to send cookies: scrapyjs/middleware.py--
import gtk
import webkit
import jswebkit
#import gi.repository import Soup # conflicting static and dynamic includes!?
import ctypes
libsoup = ctypes.CDLL('/usr/lib/i386-linux-gnu/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
def process_request( self, request, spider ):
if 'renderjs' in request.meta:
cookies = request.headers.getlist('Cookie')
if len(cookies)>0:
cookies = cookies[0].split('; ')
cookiejar = libsoup.soup_cookie_jar_new()
libsoup.soup_cookie_jar_set_accept_policy(cookiejar,0) #0==ALWAYS ACCEPT
up = urlparse(request.url)
for c in cookies:
sp=c.find('=') # find FIRST = as split position
cookiename = c[0:sp]; cookieval = c[sp+1:];
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'None',-1))
session = libwebkit.webkit_get_default_session()
libsoup.soup_session_add_feature(session,cookiejar)
webview = self._get_webview()
webview.connect('load-finished', self.stop_gtk)
webview.load_uri(request.url)
...
The code for setting the cookiejar is adapted from this response. The problem may be with how imports work; perhaps this is not the right webkit that I'm modifying -- I'm not too familiar with webkit and the python documentation is poor. (I can't use the second answer's approach with from gi.repository import Soup
because it mixes static and dynamic libraries. I also can't find any get_default_session() in webkit as imported above).
The second approach fails because sessions aren't preserved across requests, and again I don't know enough about webkit to know how to make it persist in this framework.
Any help appreciated!