I have a small internal webpage that requires a log in. When logged in, a simple HTML page is loaded, and there are javascript scripts that load the actual content of the pages.
I want to:
- Log into the page
- Run the javascript
- Extract information from the page
- Find links in the page and repeat the procedure
I found that there is a package called requests_html that sounds like the goal is to be able to do something like this. I managed to use requests_html to log into the page and get the HTML view of the page I want. It should then be possible to call
response.html.render()
and requests_html should then use pyppeteer, that downloads and launches a headless chromium, loads the webpage, renders the page, and then returns back the result. This actually works, but it only returns the log in page. The session information from requests_html is not passed to pyppeteer and/or chromium.
Is it possible to use the same session, or do I need to try to log in using only pyppeteer?
Here is a code example, but you need a small webpage with form login and javascript rendering to try it on:
from requests_html import HTMLSession
from lxml import html
url = "https://example.com"
username = "user@example.com"
password = "hunter2"
session = HTMLSession()
payload = {
"input_user": username,
"input_password": password
}
response = session.post(url, data=payload)
# Logged in here
response = session.get(url)
response.html.render()
# Output from this shows login page
print(response.html.html)