Python web-scraping yields different html than browser after log-in page

Question

Currently doing web-scraping for the first time trying to grab and compile a list of completed Katas from my CodeWars profile. You can view the completed problems without being logged in but it does not display your solutions unless you have logged in to that specific account.

Here is an inspect preview of the page display when logged in and the relevant divs I'm trying to scrape:

The url for that page is https://www.codewars.com/users/User_Name/completed_solutions with User_Name replaced by an actual username. The log-in page is: https://www.codewars.com/users/sign_in

I have attempted to get the divs with the class "list-item solutions" in two different ways now which I'll write:

#attempt 1
import requests
from bs4 import BeautifulSoup

login_url = "https://www.codewars.com/users/sign_in"
end_url = "https://www.codewars.com/users/Ash-Ozen/completed_solutions"

with requests.session() as sesh:
    result = sesh.get(login_url)

    soup = BeautifulSoup(result.content, "html.parser")

    token = soup.find("input", {"name": "authenticity_token"})["value"]

    payload = {
        "user[email]": "ph@gmail.com",
        "user[password]": "phpass>",
        "authenticity_token": str(token),
    }

    result = sesh.post(login_url, data=payload) #this logs me in?
    page = sesh.get(end_url) #This navigates me to the target page?

    soup = BeautifulSoup(page.content, "html.parser")
    print(soup.prettify()) # some debugging
    # Examining the print statement shows that the "list-item solutions" is not
    # there. Checking page.url shows the correct url(https://www.codewars.com/users/Ash-Ozen/completed_solutions).

    solutions = soup.findAll("div", class_="list-item solutions")
    # solutions yields an empty list.

and

#attempt 2
from robobrowser import RoboBrowser
from bs4 import BeautifulSoup

browser = RoboBrowser(history=True)
browser.open("https://www.codewars.com/users/sign_in")
form = browser.get_form()
form["user[email]"].value = "phmail@gmail.com"
form["user[password]"].value = "phpass"
browser.submit_form(form) #think robobrowser handles the crfs token for me?
browser.open("https://www.codewars.com/users/Ash-Ozen/completed_solutions")
r = browser.parsed()
soup = BeautifulSoup(str(r[0]), "html.parser")
solutions = soup.find_all("div", class_="list-item solutions")  
print(solutions)  # returns empty list

No idea how/what to debug from here to get it working.

Edit: My initial thoughts about what is going wrong is that, after performing either post I get redirected to the dashboard (behavior after logging in successfully) but it seems that when trying to get the final url I end up with the non-logged-in version of the page.

The form also sends `utf8=%E2%9C%93` and `user[remember_me]=true` in the data payload. What happens if you send those along with the login request? — Cohan, Jan 29 '20 at 17:37
If you mean like this `payload = { "user[email]": "mail@gmail.com", "user[password]": "pass", "authenticity_token": str(token), "utf8": "%E2%9C%93", "user[remember_me]": "true", }` It still yields no result of the correct div :/ May I ask where/how you found that? Edit: Same goes for attempt 2 method — Ash Ozen, Jan 29 '20 at 17:45
Use your browser's developer tools (usually activated by hitting `F12`) and Under the `Network` tab you can see the traffic for that page. If you open it before submitting the login form, you can see the headers sent with the login payload. You might also want to look at user agent settings, cookies, etc to see if there's anything else you're missing in your login request. Some sites have some tricky protection against scraping, but with enough research, you can probably find a good solution. — Cohan, Jan 29 '20 at 17:47
@Cohan Note that the page yielded after the post is the actual dashboard (which I hope is a sign of successful login?). Thank you for all the help so far! — Ash Ozen, Jan 29 '20 at 18:10

Python web-scraping yields different html than browser after log-in page

0 Answers0