0

I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.

This is what you get when going to the site:

Log on

Any help is appreciated :-)

user469216
  • 61
  • 1
  • 10
  • It would be helpful to know which screen scraping library you're using before providing an answer. [this thread](https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup) and [this thread](https://stackoverflow.com/questions/13925983/login-to-website-using-urllib2-python-2-7) may be able to help – Dillanm Jun 25 '18 at 09:02

3 Answers3

0

The solution is to use the public REST API for the site.

If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.

Kurtis Rader
  • 6,734
  • 13
  • 20
  • Open the networks tab in chrome fill in your credentials and take a look at the POST the site makes when you do. Copy all parameters and send that off as a payload using requests. Usually not that big a deal. – jlaur Jun 26 '18 at 18:25
  • @jlaur That isn't a general purpose solution. It assumes the cookies set by the site after you authenticate are valid forever. Which might work if the people who created the site don't know what they are doing or don't really care about security. Otherwise it isn't a viable approach. – Kurtis Rader Jun 27 '18 at 03:49
  • And that would be where requests.Session() enters. I'm merely trying to point the dude in the right direction for starters... – jlaur Jun 27 '18 at 04:56
  • Or selenium for that matter. – jlaur Jun 27 '18 at 05:06
0

Based on your screenshot this looks like it is just using Basic Auth.

Using the library "requests":

import requests

session = requests.Session() r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))

Should get you there.

0

I couldn't get Tom's answer to work but I found a work around:

from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password@webaddress.com/')

This worked :)

user469216
  • 61
  • 1
  • 10