2

I am trying to use pdfkit to make a visual backup of our company wiki. I am running into trouble since the website requires the user to be logged in to use. I developed a script using splinter that logs into the company wiki, but when pdfkit executes, it returns the log in page. PDFkit must open a different session in that case. How would I be able to find out when credentials (cookies) are needed to access the pages on my site, and save them as a variable so I can grab these screen shots?

I am using python 2.7.8 splinter, requests and pdfkit

from splinter import Browser
browser = Browser()
browser.visit('https://companywiki.com')
browser.find_by_id('login-link').click()
browser.fill('os_username', 'username')
browser.fill('os_password', 'password')
browser.find_by_name('login').click()
import pdfkit
pdfkit.from_url("https://pagefromcompanywiki.com", "c:/out.pdf")

I also found the following script which will log me in and save credentials, but I'm not sure how to tie it in to what I am trying to do.

import requests
import sys
EMAIL = ''
PASSWORD = ''
URL = 'https://company.wiki.com'
def main():
    session = requests.session(config={'verbose': sys.stderr})
    login_data = {
        'loginemail': EMAIL,
        'loginpswd': PASSWORD,
        'submit': 'login',
    }
    r = session.post(URL, data=login_data)
    r = session.get('https://pageoncompanywiki.com').

if __name__ == '__main__':
    main()

Any ideas on how to accomplish this task are appreciated

2 Answers2

1

When you log in with your Splinter browser, the site sends you HTTP cookies that identify your authorized session, and browser remembers them for further requests.

But PDFKit knows nothing about your browser. It just passes the URL you gave it down to the underlying wkhtmltopdf tool, which then fetches the page with its own default settings.

What you need to do is transfer cookies from browser to wkhtmltopdf. Thankfully, it’s easy to connect Splinter and PDFKit in this way:

options = {"cookie": browser.cookies.all().items()}
pdfkit.from_url("https://pagefromcompanywiki.com", "c:/out.pdf", options=options)
Vasiliy Faronov
  • 11,840
  • 2
  • 38
  • 49
  • This ended up working. Thank you for the help. However it only screen capped the top of the page. Any idea if there is a way to set parameters, or ensure it screen shots the entire page? –  Aug 03 '17 at 13:48
  • @ChaseRaab Not sure, sorry. Consult the [list of wkhtmltopdf options](https://wkhtmltopdf.org/usage/wkhtmltopdf.txt) — all of them can be passed via pdfkit’s `options` dictionary, as explained in [pdfkit docs](https://pypi.python.org/pypi/pdfkit). If this doesn’t help, try asking a separate question about that. – Vasiliy Faronov Aug 03 '17 at 13:51
-1

You have to handle cookies:

class CookieJar(cookielib.CookieJar):
    def _cookie_from_cookie_tuple(self, tup, request):
        name, value, standard, rest = tup
        version = standard.get('version', None)
        if version is not None:
            version = version.replace('"', '')
            standard["version"] = version
        return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup, request)

and you need an opener as well

def getOpener(self):
    handlers = []   
    cj = CookieJar();
    cj.set_policy(cookielib.DefaultCookiePolicy(rfc2965=True))
    cjhdr = urllib2.HTTPCookieProcessor(cj)
    handlers.append(cjhdr)                                             
    return urllib2.build_opener(*handlers)     

and the you would do something like

urlHandle = self.getOpener().open(request)
loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • Is there any documentation you can offer me that would help me understand the function of this? –  Aug 03 '17 at 13:10