Convert .doc/.docx to .pdf from URL, on-the-fly, with Python, on Linux

Question

I need to capture .doc or .docx files from external sites, convert them to pdf and return the content. To this I add a content-type header, publish through my CMS, cache by CDN, and display within HTML using the Adobe PDF Embed API. I'm using Python 3.7.

As a test, this works:

def generate_pdf():
    subprocess.call(['soffice', '--convert-to', 'pdf',
                    'https://arbitrary.othersite.com/anyfilename.docx'])
    sleep(1)
    myfile = open('anyfilename.pdf', 'rb')
    content = myfile.read()
    os.remove('anyfilename.pdf')
    return content

This would be nice:

def generate_pdf(url):
    result = subprocess.call(['soffice', '--convert-to', 'pdf', url])
    content = result
    return content

The URLs could include any parameters or illegal characters, which might make it hard to guess the resulting file name. Anyway, it would be preferable not to have to sleep, save, read, and delete the converted file.

Is this possible?

score 0 · Accepted Answer · answered Jun 23 '22 at 18:12

0

I don't think soffice supports outputting to stdout so you don't have many choices. If you output to a temporary directory, you can use listdir to get the filename though:

import subprocess
import tempfile
import os

url = "https://www.usariem.army.mil/assets/docs/journal/Lieberman_DS_survey_and_guidelines.docx"
with tempfile.TemporaryDirectory() as tmpdirname:
  subprocess.run(["soffice", '--convert-to', 'pdf', "--outdir", tmpdirname, url], cwd="/")
  files = os.listdir(tmpdirname)
  if files:
    print(files[0])

answered Jun 23 '22 at 18:12

davidli

361
1
8

Thanks. I guess as long as I've opted for soffice, I'll need to save and remove the file. This looks good for now. – Ken Jun 23 '22 at 21:28
I still needed to add sleep(0.6) - the minimum - after subprocess.run to get the file – Ken Jun 23 '22 at 21:55
What error do you get? The process should be done after `run` or `call` since they are blocking: https://colab.research.google.com/drive/1Ft-Ohc7scUlMVGmvnWdVVBJeShKKHGaM#scrollTo=203hcAC8lKlW – davidli Jun 23 '22 at 22:05
Using your code above: `return [str(files), str(tmpdirname)] ` no sleep: `['[]', '/tmp/tmp7dymx1xq']` sleep(0.5): `["['.~lock.ext_tor-mhpss-mhpss-psychologist-counsellor-jun22.pdf#']", '/tmp/tmpkgb2z0th']` sleep(1): `["['ext_tor-mhpss-mhpss-psychologist-counsellor-jun22.pdf']", '/tmp/tmponhevdir']` – Ken Jun 24 '22 at 13:49
I was able to run it without problem on the colab.research.google.com you shared. So it's my system? – Ken Jun 24 '22 at 14:44
I'm not sure but I can't reproduce the issue. I would just do `sleep(0.1)` in a loop until the file appears without `~lock`. – davidli Jun 24 '22 at 15:02
It seems to work consistently - on my system - with sleep(0.6). Your tempfile solution was really what I needed. The resulting pdf is cached by Cloudflare once the parent page has been visited, so a few hundredths in the processing won't matter much. – Ken Jun 24 '22 at 15:12

Convert .doc/.docx to .pdf from URL, on-the-fly, with Python, on Linux

1 Answers1