Converting PDF to ODT/DOC using Apache OpenOffice

Question

I am using IronPython+PyFPDF to produce reports in PDF format that has images, tables and text. Well since PDF was never meant to be an editable/floating document I wonder if there is a way to convert it into any editable document like ODT/Doc keeping the document formatting intact as much as possible.

I have explored several ways and possible approaches

PDF -> HTML -> Word (Using pdftohtmlEx and pandas to get doc from html but looks like pdftohtmlEx does not preserve the formatting of the document)
Using MS Word or Apache Open Office(Depending on the server, considering the application writers are present) to convert as they have the functionality to do it from the GUI so there must be some way to do it from command line and then use that command line from python subprocess to do it programmatically

I am om to explore any third party library/packages the only problem/restriction is that IronPython does not support packages which has heavy dose of C code like docx-mailmerge, python-docx, numpy, pandas

Summing it all, I see the best option is to use Word or Apache Open Office writers to do the work but I am not sure how to achieve it through command-line

Can anyone please point me to the right direction?

Converting PDF to ODT/DOC using Apache OpenOffice

0 Answers0