Good Evening,
I've been researching around to understand how GET and POST requests work using the requests module. In the past I've made .get and .post request, however, it's still a brand new concept to me (along with web scraping). I've been trying to use the following link: https://www.aconvert.com/pdf/ to upload a PDF document and convert it to an HTML document utilizing Python 2.7, but no luck. I can't seem to get the correct parameters. The website also has an API for making requests (described at https://www.aconvert.com/api.html) but I don't really understand how it works. I've tried several things. My last attempt looked something like the code below:
import requests
pdf_file = r"PATH_TO_PDF.pdf"
session_requests = requests.session()
data = {'file': pdf_file, 'targetformat':'HTML'}
#r = session_requests.get('https://www.aconvert.com/pdf/')
out = session_requests.post('https://www.aconvert.com/pdf/', data)
print out.text
The output just displays the HTML source code content for the site. Doing the conversion manually by inputting a PDF file and target format results in a display area providing the HTML output. When clicked, it provides the results (such as: https://s2.aconvert.com/convert/p3r68-cdx67/cbzdr-c6wcd.html). If doing the same thing through the API section of the site, it quickly displays the HTML link and SUCCESS (returned as a dictionary).
Any examples would be greatly appreciated demonstrating the uploading of a pdf file, and the extraction of the resulting HTML along with a short explanation so that I can better understand how it all works.
if further clarification is required, please let me know.
Thanks!
EDIT: I'm still trying to get my result. Making progress but no good. In case anyone could help, I now have the following code:
import requests
pdf_file = r'PATH_TO_PDF'
session_requests = requests.session()
files = {'file': open(pdf_file, 'rb')}
payload = {'targetformat': 'HTML', 'ocrlan': '0', 'filelocation':'local'}
return_out = session_requests.post('https://s2.aconvert.com/convert/convert-batch-win.php', files = files, data=payload)
print return_out.text
I at least get a different ERROR response from the website. This prints: {"result":"3-pdf-HTML--local-2","state":"ERROR"}.
Not sure what I'm exactly doing wrong now. I inspected the source code and monitored live HTTP traffic with a tool. I believe I have the header, payload information page if needed.