1

I'm currently working on a Django web-app that needs to retrieve several documents from a DMS, merge them into a single, large PDF, and distribute this file as output for the user.

The largest issue in this process stems from the fact that some of these files are retrieved by python in doc/docx format. Typically I would use something like pythoncom and comtypes.client to convert these files before moving forward with the merger, like so:

    wdFormatPDF = 17
    pythoncom.CoInitialize()

    word = comtypes.client.CreateObject('Word.Application')
    word.Visible = False
    doc = word.Documents.Open([retrieved doc file])
    doc.SaveAs(os.path.join([newly created pdf file]), FileFormat=wdFormatPDF)
    doc.Close()
    word.Quit()

However, this only works on a machine that has Microsoft Word installed. Since the app would ideally be running on an IIS server, this isn't really an option in my environment.

I considered testing pypandoc and miktex/xelatex (which would still require external references on the Windows Server, but my options are starting to seem limited), like so

output = pypandoc.convert_file([retrieved doc file]), 'pdf', outputfile=os.path.join([newly created PDF file]))

While this creates the PDF, there are problems with the conversion. I can account for some by adding font settings to the extra arguments, but the doc files have images and some specific alignments that don't translate well.

I'm also aware of Reportlab, although it seems designed more for creating a PDF based on existing text, rather than porting a complete document, images and all.

Thus my question is: Is there a way to perform this conversion as cleanly as can be done with the Word.Application comtype, but without having Word installed? And if not, are there any other packages available that I've been unable to find or use properly?

pseudoku
  • 716
  • 4
  • 11
  • `unoconv`, `imagemagick`? are you able to install any extra stuff on the server? Or you are looking for a pure python way to do it? What sort of throughput (documents per second) are being translated? – jmunsch Apr 18 '18 at 19:38
  • I can consider installing some extra things on the server, my concern mainly stems from it being a web server that currently hosts many other apps as well, and thus some heavy testing would be required for anything larger (which is why I was testing latex and pandoc as an alternative to Microsoft Word). Thanks for the suggestions: As far as I was aware, unoconv would require Open Office, and thus be similar to using comtypes. Does imagemagick support document formats in addition to the images listed? – pseudoku Apr 18 '18 at 19:56
  • `convert` which is part of imagemagick might be able to. with a subprocess.Popen type of call. see : https://github.com/ImageMagick/ImageMagick/blob/2273d6ae4312a5e38bc282216ea12fcbdc04b2ca/config/delegates.xml.in#L69 – jmunsch Apr 18 '18 at 19:58
  • but actually upon further investigation looks like `convert` requires installing a DOCDecodeDelegate, DOCXDecodeDelegate, which is just the libre/open office headless exe – jmunsch Apr 18 '18 at 20:05
  • 1
    You cannot use that approach even if you install Word, as server side Office automation is not supported, https://support.microsoft.com/en-us/help/257757/considerations-for-server-side-automation-of-office The KB article also provides alternative methods, such as SharePoint Word Automation Services. There are also third party solutions you can Google. – Lex Li Apr 18 '18 at 21:17

0 Answers0