2

All,

I am modifying a python script (using PyUno) that will read in MSword document (.docx) and convert it to xml. I have a script that will do everything I need here, except it will convert from doc to pdf. I cannot find a list of acceptable export formats for xml.

Any help will be appreciated.

Thanks!

:bp:

NotCharlie
  • 43
  • 2
  • 9
  • Clarification: The code referenced above uses: property name = "FilterName" and value as "writer_pdf_Export" -- what is the equivalent for an XML file? – NotCharlie Jan 05 '16 at 18:10

1 Answers1

0

These two FilterName values produce different flat XML formats:

  • OpenDocument Text Flat XML
  • MS Word 2003 XML

I found these names by doing this:

  1. Enabled macro recording by going to Tools -> Options -> Advanced, check "Enable Macro Recording".
  2. Tools -> Macros -> Record Macro.
  3. File -> Save As. Selected various options for the type.
  4. Named the macro, then checked the FilterName property in the resulting Basic code.

Keep in mind that .odt and .docx are also XML-based formats, only they are zipped up rather than flat. It is possible to parse files in these formats by doing something like this:

import os
import xml.dom.minidom
import xml.parsers.expat
import zipfile

filepath = "in.odt"  # or "in.docx"
tempDir = "path/to/temp/dir/"  # change according to your system
with zipfile.ZipFile(filepath, 'r') as zipper:
    zipper.extractall(tempDir)
try:
    dom = xml.dom.minidom.parse(os.path.join(tempDir, "content.xml"))
except xml.parsers.expat.ExpatError:
    # handle exception
Jim K
  • 12,824
  • 2
  • 22
  • 51