0

Good Evening,

I've been researching around to understand how GET and POST requests work using the requests module. In the past I've made .get and .post request, however, it's still a brand new concept to me (along with web scraping). I've been trying to use the following link: https://www.aconvert.com/pdf/ to upload a PDF document and convert it to an HTML document utilizing Python 2.7, but no luck. I can't seem to get the correct parameters. The website also has an API for making requests (described at https://www.aconvert.com/api.html) but I don't really understand how it works. I've tried several things. My last attempt looked something like the code below:

    import requests

    pdf_file = r"PATH_TO_PDF.pdf"
    session_requests = requests.session()
    data = {'file': pdf_file, 'targetformat':'HTML'}
    #r = session_requests.get('https://www.aconvert.com/pdf/')
    out = session_requests.post('https://www.aconvert.com/pdf/', data)
    print out.text

The output just displays the HTML source code content for the site. Doing the conversion manually by inputting a PDF file and target format results in a display area providing the HTML output. When clicked, it provides the results (such as: https://s2.aconvert.com/convert/p3r68-cdx67/cbzdr-c6wcd.html). If doing the same thing through the API section of the site, it quickly displays the HTML link and SUCCESS (returned as a dictionary).

Any examples would be greatly appreciated demonstrating the uploading of a pdf file, and the extraction of the resulting HTML along with a short explanation so that I can better understand how it all works.

if further clarification is required, please let me know.

Thanks!

EDIT: I'm still trying to get my result. Making progress but no good. In case anyone could help, I now have the following code:

import requests
pdf_file = r'PATH_TO_PDF'
session_requests = requests.session()
files = {'file': open(pdf_file, 'rb')}
payload = {'targetformat': 'HTML', 'ocrlan': '0', 'filelocation':'local'}
return_out = session_requests.post('https://s2.aconvert.com/convert/convert-batch-win.php', files = files, data=payload)
print return_out.text

I at least get a different ERROR response from the website. This prints: {"result":"3-pdf-HTML--local-2","state":"ERROR"}.

Not sure what I'm exactly doing wrong now. I inspected the source code and monitored live HTTP traffic with a tool. I believe I have the header, payload information page if needed.

Darican
  • 29
  • 5

1 Answers1

0

Figured out the solution in case someone has similar issues. I wasn't aware of all the different parameters that can be past in a .post and how they work. Such as files=, data=, params=, headers=, detting redirects to true, etc. After some playing around, I figured it out. My code now posts to a website (a pdf) and converts the pdf to any chosen file type such as html, doc, docx, csv, and more. The beauty of it all is that the site is also able to show the hidden hyperlinks in the pdf.

import requests
session_request = requests.session()
pdf_file = r'PATH_TO_PDF_FILE'
files = {'file': open(pdf_file, 'rb')}
data = {'targetformat': 'html'}
return_post = session_request.post('https://s2.aconvert.com/convert/api-win.php', files = files, data = data)
return_get = session_request.get(str(return_post.text).split('"')[3])
print return_get.text
Darican
  • 29
  • 5