Read page source before POST

Question

I want to know if there is a way to POST parameters after reading the page source. Ex: read captcha before posting ID#

My current code:

import requests
id_number = "1"
url = "http://www.submitmyforum.com/page.php"
data = dict(id = id_number, name = 'Alex')
post = requests.post(url, data=data)

There is a captcha that is changeable after every request to http://submitforum.com/page.php (obv not a real site) I would like to read that parameter and submit it to the "data" variable.

Are you wondering how to get a data from a webpage before loading this webpage ? _I feel like i'm missing a good schrodinger cat joke .._ — Arount, Jun 14 '17 at 12:14
You can use a get request, parse the page source, get the captcha ID/treat it, then send your post request using correct captcha data. But it will not always work depending on the captcha system in use. I finally ended using webbrowser emulation (ex: selenium python implementation) for this kind of purpose as it can be used to keep same session and such stuff. (Was doing captcha security analysis and auto completion of them) — Retsim, Jun 14 '17 at 12:17
@Arount I am trying to read the web page source so that I could grab a changeable value and add it to my data variable. — Jeremy Claus, Jun 14 '17 at 12:30
@Retsim Could you elaborate more on this? Wouldn't the get request give a different captcha than the post request? I assume I'll be doing 2 requests in this scenario? — Jeremy Claus, Jun 14 '17 at 12:31
@JeremyClaus Yes, you'll be doing 2 requests in this scenario, I can't see any other way (even emulating a browser, you will need two) - I still may be wrong -, as before sending anything, you need to parse the web page first. It's not always a problem as some captcha systems will not change their challenge (IP-based, time-based, session-based) and will allow you to adapt to this. If your captcha is different (ex: Google reCaptcha), webbrowser emulation is then the easiest way I know to achieve this without much efforts. — Retsim, Jun 14 '17 at 12:37
@Retsim Alright thanks mate. The problem here is not solving the captcha, but adding the captcha to the data variable. Care to provide a documentation to the method you mentioned in your last comment? — Jeremy Claus, Jun 14 '17 at 12:54
@JeremyClaus I added a code sample with some comments as a potential answer, from my selenium captcha analysis script using Selenium, hope it helps ! — Retsim, Jun 14 '17 at 13:10

score 0 · Accepted Answer · answered Jun 14 '17 at 13:09

As discussed in OP comments, selenium can be used, methods without browser emulation may also exists !

Using Selenium (http://selenium-python.readthedocs.io/) instead of requests module method:

import re
import selenium
from selenium import webdriver

regexCaptcha = "k=.*&co="
url = "http://submitforum.com/page.php"

# Get to the URL
browser = webdriver.Chrome()
browser.get(url)

# Example for getting page elements (using css seletors)
# In this example, I'm getting the google recaptcha ID if present on the current page
try:
    element = browser.find_element_by_css_selector('iframe[src*="https://www.google.com/recaptcha/api2/anchor?k"]')
    captchaID = re.findall(regexCaptcha, element.get_attribute("src"))[0].replace("k=", "").replace("&co=", "")
    captchaFound = True
    print "Captcha found !", captchaID
except Exception, ex:
    print "No captcha found !"
    captchaFound = False

# Treat captcha
# --> Your treatment code

# Enter Captcha Response on page
captchResponse = browser.find_element_by_id('captcha-response')
captchResponse.send_keys(captcha_answer)

# Validate the form
validateButton = browser.find_element_by_id('submitButton')
validateButton.click()

# --> Analysis of returned page if needed

Exactly what I wanted. Thanks buddy. – Jeremy Claus Jun 15 '17 at 07:39 — Jeremy Claus, Jun 15 '17 at 07:39

Read page source before POST

1 Answers1