Python Web-Scraping CSRF Token Issue

Question

I am using MechanicalSoup to login to a website via Python 3.6 and I'm having issues with the CSRF token.

Every time i request the html back i read "Invalid CSRF token: Forbidden". Searching the html on the login page, the closest match for a element id that looks like the token is "autheticity_token" which appears to be already filled in with the token.

I was able to use "re" module to extract the token and resubmit it to the element with the id i talked about above but no luck. Note, i had to find the element by id since a name is not provided for it (this is why my Robobrowser way of doing it didn't work).

This is the element that I think corresponds to the CSRF:

<input id="authenticity_token" type="hidden" value="b+csp/9zR/a1yfuPPIYJSiR0v8jJUTaJaGqJmJPmLmivSn4GtLgvek0nyPvcJ0aOgeo0coHpl94MuH/r1OK5UA==">

I would extract, in this case "b+csp/9zR/a1yfuPPIYJSiR0v8jJUTaJaGqJmJPmLmivSn4GtLgvek0nyPvcJ0aOgeo0coHpl94MuH/r1OK5UA==" and resubmit it to that element

Here is my code with dummy values for user,pass, and url

import mechanicalsoup
import re

def return_token(str1):
    match1 = "authenticity_token"
    match2 = ".*value\=\"(.*)\".*"
    for x in range(len(str1)):
        line = str1[x]
        if re.findall(match1,line):
            token = re.findall(match2,line)[0]
            return token

url1 = ""
username = ""
password = ""

browser = mechanicalsoup.Browser()
page = browser.get(url1)
str0 = page.text
token = return_token(str0.split('\n'))
#print(str0)
form = page.soup.find("form",{"id":"loginForm"})

form.find('input', {'name': 'username'})['value'] = username
form.find('input', {'name': 'password'})['value'] = password
form.find('input', {'id': 'authenticity_token'})['value'] = str(token)

response = browser.submit(form, page.url)
print(response.text)

That regex is wrong by the looks of it - have you tried looking at your `token` and checking it's correct. Also - doesn't that library support getting the value from the form for you instead of using that approach? — Jon Clements, Oct 25 '17 at 22:12
@JonClements The Regex does work, that's not the issue, it extracts the value inside the quotes just fine. I can't use robobrowser because it gets a form and then allows you to enter values by the names of elements (this one doesn't have a name and is omiited). — Mets_CS11, Oct 25 '17 at 22:16
Looks like it's extract everything to the next quote it can find, not the most immediate one... — Jon Clements, Oct 25 '17 at 22:18
I checked the value during a couple runs and it looks right. I think the issue I'm having is I'm not even sure if i'm submitting to the right element or if this is the right way to go about it — Mets_CS11, Oct 25 '17 at 22:24
Just had a very quick look at the library (literally a 2 minute quick read) - aren't you supposed to use `mechanicalsoup.Form` on the form element to create a form instance that you then populate and submit? That also looks like it'll also take into account fields you don't override such as hidden fields and send them along with your request as well... Not sure though - haven't tried... but looks like that's the way it's supposed to be used. (If so - you don't even need to worry about that field anymore) — Jon Clements, Oct 25 '17 at 22:26
hmmm, Is there any way you know of where I can login manually in google chrome for example and then use the open window (already logged in) to browser and export html. I really only need to login and then go to a url and extract the data on the page. If i can avoid the logging in part and just do it once manually that would be fine. — Mets_CS11, Oct 25 '17 at 22:49

score 1 · Answer 1 · answered Oct 26 '17 at 01:35

I believe the issue here is that <input> elements must have name attributes for them to be submitted via POST or GET. Since your token is in a name-less <input> element, it is not processed by MechanicalSoup because that's what the browser would do.

From the W3C specification:

Every successful control has its control name paired with its current value as part of the submitted form data set. A successful control must be defined within a FORM element and must have a control name.

...

A control's "control name" is given by its name attribute.

Perhaps there is some JavaScript that is handling the CSRF token.

For a similar discussion, see Does form data still transfer if the input tag has no name?

Regarding your usage of MechanicalSoup, the classes StatefulBrowser and Form would simplify your script. For example, if you just had to open the page and input a username and password:

import mechanicalsoup

# These values are filled by the user
url = ""
username = ""
password = ""

# Open the page
browser = mechanicalsoup.StatefulBrowser(raise_on_404=True)
browser.open(url)

# Fill in the form values
form = browser.select_form('form[id=loginForm]')
form['username'] = username
form['password'] = password

# Submit the form and print the resulting page text
response = browser.submit_selected()
print(response.text)

Python Web-Scraping CSRF Token Issue

1 Answers1

Linked