0

I have a small internal webpage that requires a log in. When logged in, a simple HTML page is loaded, and there are javascript scripts that load the actual content of the pages.

I want to:

  • Log into the page
  • Run the javascript
  • Extract information from the page
  • Find links in the page and repeat the procedure

I found that there is a package called requests_html that sounds like the goal is to be able to do something like this. I managed to use requests_html to log into the page and get the HTML view of the page I want. It should then be possible to call

response.html.render()

and requests_html should then use pyppeteer, that downloads and launches a headless chromium, loads the webpage, renders the page, and then returns back the result. This actually works, but it only returns the log in page. The session information from requests_html is not passed to pyppeteer and/or chromium.

Is it possible to use the same session, or do I need to try to log in using only pyppeteer?

Here is a code example, but you need a small webpage with form login and javascript rendering to try it on:

from requests_html import HTMLSession
from lxml import html

url = "https://example.com"
username = "user@example.com"
password = "hunter2"
session = HTMLSession()
payload = {
    "input_user": username,
    "input_password": password
}
response = session.post(url, data=payload)
# Logged in here
response = session.get(url)
response.html.render()

# Output from this shows login page
print(response.html.html)
MrBerta
  • 2,457
  • 12
  • 24

2 Answers2

1

You can install the github version of requests-html and use the following parameter to render():

response.html.render(send_cookies_session=True)

This will maintain your login authorization from your session in the Chromium page instance used to render.

virantha
  • 11
  • 1
  • 1
    Thanks for the answer! Can you post a link to the github version that you are talking about as well? – MrBerta Dec 18 '19 at 08:39
0

The download of the github version (I suppose a less stable version) is not required. You can specify reload=False as follows:

response.html.render(reload=False)

Just saw that this is from 2019... I guess better late than never and yes that is what she said ;-)

Andres R
  • 120
  • 1
  • 5