I'm currently working on a Django web-app that needs to retrieve several documents from a DMS, merge them into a single, large PDF, and distribute this file as output for the user.
The largest issue in this process stems from the fact that some of these files are retrieved by python in doc/docx format. Typically I would use something like pythoncom and comtypes.client to convert these files before moving forward with the merger, like so:
wdFormatPDF = 17
pythoncom.CoInitialize()
word = comtypes.client.CreateObject('Word.Application')
word.Visible = False
doc = word.Documents.Open([retrieved doc file])
doc.SaveAs(os.path.join([newly created pdf file]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
However, this only works on a machine that has Microsoft Word installed. Since the app would ideally be running on an IIS server, this isn't really an option in my environment.
I considered testing pypandoc and miktex/xelatex (which would still require external references on the Windows Server, but my options are starting to seem limited), like so
output = pypandoc.convert_file([retrieved doc file]), 'pdf', outputfile=os.path.join([newly created PDF file]))
While this creates the PDF, there are problems with the conversion. I can account for some by adding font settings to the extra arguments, but the doc files have images and some specific alignments that don't translate well.
I'm also aware of Reportlab, although it seems designed more for creating a PDF based on existing text, rather than porting a complete document, images and all.
Thus my question is: Is there a way to perform this conversion as cleanly as can be done with the Word.Application comtype, but without having Word installed? And if not, are there any other packages available that I've been unable to find or use properly?